For  Reference 


NOT  TO  BE  TAKEN  FROM  THIS  ROOM 


(?X  MBMS 
W1W0BW 


THE  UNIVERSITY  OF  ALBERTA 

A  FACTOR  ANALYTIC  ITEM  SELECTION  PROCEDURE 

BY 

MERLIN  WALTER  WAHL STROM 


A  THESIS 

SUBMITTED  TO  THE  FACULTY  OF  GRADUATE  STUDIES 
IN  PARTIAL  FULFILMENT  OF  THE  REQUIREMENTS  FOR  THE  DEGREE  OF 

DOCTOR  OF  PHILOSOPHY 


DEPARTMENT  OF  EDUCATIONAL  PSYCHOLOGY 


EDMONTON,  ALBERTA 
APRIL,  1967 


.  I  hi  .**  J  181  rv.  W)  H'J 


:■  ,  )<i  ■  :  JT  0  JTT8  M.  T  Oil  .  M-  101 :  .  A 


l  )Hj  T  JAW  WIJJE5M 


c  ! H,  >  ,i  a  ;;  Y  :■  .  3H :  oi  i  0  ri : 


)'  v;  J,  •  W  T  .  i  : ' Ac'  I C 


UNIVERSITY  OF  ALBERTA 


FACULTY  OF  GRADUATE  STUDIES 


The  undersigned  certify  that  they  have  read  and  recommend 
to  the  Faculty  of  Graduate  Studies  for  acceptance  a  thesis  entitled 
MA  Factor  Analytic  Item-Selection  Procedure”  submitted  by  Merlin 
Walter  Wahlstrom  in  partial  fulfilment  of  the  requirements  for  the 
degree  of  Doctor  of  Philosophy. 


ABSTRACT 


The  problem  investigated  was  concerned  with  developing  a 
theoretical  basis  for  an  item-selection  algorithm  using  factor 
analystic  methods.  After  a  test  has  been  administered  to  a  group 
of  subjects  and  criteria  variables  are  obtained,  an  item-criterion 
intercorrelation  matrix  is  calculated.  The  intercorrelation 
matrix  is  then  factor  analyzed  to  determine  the  factors  that  define 
the  item  and  criterion  space.  A  rotation  of  the  resulting  ortho¬ 
gonal  factor  structure  is  applied  to  provide  a  final  solution  that 
has  simple  structure  properties.  Each  factor  is  then  assigned  a 
relative  weight  by  the  test  constructor  to  determine  the  position 
of  a  hypothetical  goal  vector  in  the  factor  space.  The  goal  vector 
defines  the  desired  characteristics  of  the  test  to  be  constructed. 

Initially,  the  two  items  having  the  largest  correlations 
with  the  goal  vector  are  selected.  A  composite  vector  is  formed  by 
calculating  the  centroid  of  the  two  selected  items.  Prior  to 
selecting  the  next  item,  the  characteristics  for  the  item  to  be 
selected  are  defined  so  that  the  goal  vector  and  the  composite 
vector  are  nearly  collinear.  Each  additional  item  selected  has 
properties  that  are  the  best  approximation,  within  limits  of  the 
items  available  for  selection,  for  producing  collinear ity .  When 
all  items  for  constructing  a  test  have  been  selected,  the  centroid 
of  the  k  selected  items  then  determines  the  location  of  the 
composite  vector. 

An  estimate  of  the  constructed  test's  validity  is  given  by 


s  gniqolavsb  rfdlw  bsnioonoo  esw  b9dsgJ:da9vnx  ntaldoiq  oriT 
iod  i  t  -niau  xriJixoglB  noi:?D  laa-irsji  £b  lo'r  aisfid  iBOXdaxoarid 

:  oiiBll'to-nali  hb  tb9nxBddo  sts  89ldaJ:TBV  BJtisJlrro  fane  adoatdue  io 
no±dBl9noo*i9dni  arfT  . bsdBluoIso  ei  xlTdara  co±d£l9i:rood3dni 

-orid'io  qnldlueai  9dd  io  noxdadot  A  . 9oaq8  noii9dlTD  bna  raaJjt  add 

.  39±d‘isqoTq  9TjJdou-d3  slqnue  sax! 
xioxdxaoq  add  9xixmT9d9b  od  TodouidaxToo  desd  arid  yd  drig  :w  ovidBloi 
Todosv  iaog  9riT  .90fiqs  xodOBi  arid  rri  Todo9V  Iaog  laoidsddoqyd  b  io 

;)9dox  h  od  9fJ  od  d  d  ;d  o  a  n  ifirio  hr  ijt  ...•  '  ..a  U  r. 

enoxdalsTioo  de9gT£l  9rid  giXjtvad  am9di  owd  arid  , yllaidi  J 
yd  bsa  roi  ai  Todo.v  adxaoqmoo  A  .b9d09l93  stx  -iod09v  leog  arid  ridiv 

.a.TOdi  osdo  Is  i  cwd  arid  io  bxoTdnao  9rid  .•d  .jo  I  bo 
sd  od  nradfc  add  Toi  aoIdaiTsdoBTado  9dd  fnJ9d±  dxsn  9rid  gnxdaataa 
sd.  ioqtflOD  xfd  bxia  io  o.ev  it  3  9dd  daxfd  o;  baaiisb  stb  t9d09l9e 

.dB9nltIoo  yliaan  9:rB  trod  os  v 
j  ic  i  din  tv  ,no  snri  xoiqqs  di  J  !  rid  9ts  dsrid  aaxddsqoTq 

.ydiia  nllioo  gxiioubod  7  xoi  ^aoidaslaa  *roi  aid  : XIbvb  exiJ9dx 
dJtoTdi  30  9dd  fbydo9l9a  xiood  9 vari  aad  6  ;  rdouidanoo  to  i  ansdi  IIb 
add  io  noidsoox  9dd  isnJ  m' j  1  ab  nc*rid  am9di  b9do9l9e  ^  9dd  io 

. iod 09V  sdieoqisoo 

yd  n9V±g  ei  ydibi  li  v  e  ’  daod  bodouidenoo  9dd  io  9daxai:d39  x; A 


the  correlation  between  the  composite  vector  and  the  goal  vector. 
Reliability  is  defined  as  the  proportion  of  variance  accounted 
for  by  a  test  vector.  Two  reliabilities  can  easily  be  obtained 
by  considering  the  item  clustering  about  either  the  goal  vector 
or  the  composite  vector.  Since  items  are  selected  according 
to  the  goal  vector's  characteristics,  a  meaningful  value  would 
be  in  terms  of  item  projections  on  the  goal  test.  However,  a 
truly  internal  consistency  estimate  of  reliability  is  obtained 
by  using  the  composite  test  vector  which  has  as  co-ordinates  the 
centroid  of  the  selected  items. 

The  procedure  for  selecting  items  is  not  intended  to 
replace  existing  item  analysis  methods  but  rather  extends  the 
analytic  approach  of  the  test  constructor.  In  the  proposed 
method,  primary  consideration  has  been  given  to  meaningfulness 
and  practicality  of  application.  With  electronic  machines  to 
handle  the  major  part  of  selecting  items,  effort  on  tedious 
nonprof itable  tasks  should  be  reduced. 


.  v  f  s*  I  itiD-o  s^jtaoqaroo  9  3  r  »ow3?  tv.  fcJj  noo  s.  J 


r.-jor  QDtinluv  o  noXJ-i  qotlq  »r  «r  '  & b  8±  ^31.1  £  it 


. 


bli/ov  9ulBV  luignlne  m  s  t BoJti8ii9iDfcisrfo  e'soJssv  Iscq  srj  t  cl 


arf*  esiarlbio  o  z&  j.t.d  ::jir  r  soJssv  3l-®3  sSleoqm  o  ->rll  3*  au  yc 


•  ■  lx  !  a:  .  9':  9j'  J  '*  ■  -  b  o i  , 


srfj  ebns^x9  isrfJsx  1  ud  abori^sm  axe  .  gniualxs  soAlqsi: 


t.  O  .  V  J..  >3'i  c  LJ  '  -;ablenOO  '•  •  •  "  •  ' 


o3  aanirfoera  orncn  J  oe  Is  X:fi:W  .  rroiJa^  Iqqfi  y;3j.Ib:>.  s.  q  ns 


ACEiC  VLEDG^Ein  S 


The  writer  wishes  to  acknowledge  gracefully  the  assistance 
of  the  members  of  his  supervisory  and  candidacy  c:~i::ees  in  the 
preparation  of  this  dissertation .  In  particular.  Dr.  S.  X.  Tunka, 
chairman  of  the  ccrrrittee,  provided  guidance  without  which  the 
writer  could  not  have  succeeded. 

Tne  constructive  suggestions  of  Dr.  7.  J.  Bcersna, 

Dr.  D.  P.  McLeod,  Dr.  L.  D.  Nelson,  and  Dr.  K.  Snillie  are  nest 
appreciated . 

Finally,  the  author  is  grateful  to  the  University  of  Alberta 
for  providing  the  financial  assistance  which  allowed  him 
graduate  studies  and  complete  this  thesis. 


to  c ur sue 


1 4  ’•  ■  ■  ■ 


9O0fii8i88B  9ff3  ^IIut9^B^g  agbslwo.' -:af3 oi  a& Jaiw  isSiiw  srfT 


i  t  -  tsimaioo  yp^bib  bo  : -  »  ^*xoaJ  1901-3  aid  So  z  sd-  .•  .  oris  So 
,s>fnUi  .M  .2  .7(1  jiBluois  sq  rrl  .rrol3fc3i9i3lb  aJt/li  o  aorSeiaqaiq 
s-  IS  x  tr  ■'  Jjuofiu i' 7  9 on  ixg  ■ 1  ,  ss .ti  o  .  dJ  So  n  ^rrians 

.babsaiouz  bv  d  Son  bluoo  xsSItv 

Seoro  97B  irSIXJtinS  .X  .  iG  bne  <nc  .Cl  .J  .  .bo-  nM  .  S[  .G  .  tG 

.  bsSsi  D9iqt  *a 


Ei  t&dlA  So  ^uiaTSvinU  ads  01  luissa-rg  si  ToriSu;:  arid  .  aniS 


tueruc  os  mtri  bawoils.  rio  r!w  9Z1  u  tae  IbId n  S  9dS  .nibivoiq  70S 

.ax  :sds  a  :ds  slq  o  bna  aalbuSa  s  laubi  ig 


TABLE  OF  CONTENTS 


CHAPTER  PAGE 

I  INTRODUCTION  ....  .  .....  1 

II  GENERAL  PROBLEM  .  .....  5 

III  MEASUREMENT  IN  EDUCATION  AND  PSYCHOLOGY . .  13 

Test  Theory . 15 

Axioms  and  Principal  Results  .  15 

Test  Score . 17 

Test  Score  Distribution  .  17 

Standard  Error  of  Measurement  .  .....  21 

Transformation  of  Scores  .  23 

Test  Development .  25 

Qualitative  Criteria  .  25 

Quantitative  Criteria  .  ...  30 

IV  ASSESSMENT  OF  TESTS . . .  32 

Validity  . . 32 

Reliability  .  ......  .  37 

Interdependencies  .  ......  40 

Factors  Affecting  Validity  and  Reliability  ....  40 

The  Criterion  .  . .  45 

Item  Analysis .  46 

Factor  Analysis  .  49 

Relationship  of  Item  to  Test  Score .  52 


Correction  for  Attenuation 


54 


, 

I 

A t  . . aoX3mms33A  ioi  a©.ki;>»noO 


CHAPTER  PAGE 

V  REVIEW  OF  ITEM  SELECTION  PROCEDURES  .  58 

Weighting  .  58 

Regression  Procedures  .  59 

Canonical  Correlation  .  61 

Factor  Analysis  .  62 

Scale  Analysis .  62 

VI  THEORETICAL  DEVELOPMENT  OF  THE  SELECTION  TECHNIQUE  .  64 

Factor  Analysis  and  Prediction  .  65 

Proposed  Selection  Technique  .  65 

General  Description  of  the  Selection  Procedure  .  .  68 

Mathematical  Description  of  the  Selection  Procedure  74 

Criteria  for  Item-Selection  .  80 

Validity  and  Reliability  Estimates  .  83 

Worked  Example  of  the  Selection  Technique  .  85 

VII  EVALUATION  OF  THE  ITEM  SELECTION  METHOD .  92 

Comparison  to  Other  Models  .  93 

Versatility  of  the  Selection  Algorithm  .  95 

Item  Pools .  96 

Limitations .  96 

Test  Constructor  Involvement  .  99 

VIII  SUMMARY,  CONCLUSIONS  AND  IMPLICATIONS . 100 

The  Problem  and  Proposed  Solution  .  100 

General  Conclusions  .  101 


!«***  3 


CHAPTER 


PAGE 


Implications . 101 

Theoretical . 101 

Practical . 102 

Implications  for  Further  Research  .  102 

REFERENCES .  104 


APPENDIX  A  -  Item  Selection  Algorithm 


112 


38A1  JSmAHD 

. .  .  . . anol^fiDlXqinl 


LIST  OF  FIGURES 


FIGURE  PAGE 

1  Position  of  the  Goal  Test  Vector  in  the 

Common  Factor  Space  .  70 

2  Relative  Positions  of  Item  Vectors  in  the 

Common  Factor  Space  .  71 

3  Orthogonal  Clusters  of  Items  .  98 


oril  nJt 

X 


CHAPTER  I 


INTRODUCTION 

In  recent  years  there  has  been  considerable  interest  in  objecti¬ 
fying  measurement  and  evaluation  procedures  (Greene,  Jorgenson  and 
Gerberich,  1954;  Thorndike  and  Hagen,  1955;  Helmstadter,  1964).  Increasing 
emphasis  has  been  given  to  multiple-choice  tests  that  can  be  machined 
scored . 

It  is  commonly  accepted  by  test  experts  that  multiple-choice 
tests  are  the  most  highly  regarded  and  widely  used  form  of  objective  test 
item.  "Almost  any  understanding  or  ability  that  can  be  tested  by  means 
of  any  other  item  form  .  .  .  can  also  be  tested  by  means  of  multiple- 
choice  test  items"  (Ebel,  1965,  p.  149).  Although  there  has  been  much 
criticism  regarding  the  use  of  multiple-choice  tests  (Hoffman,  1962), 
the  critics  seldom  seriously  attempt  to  make  a  good  case  for  a  better 
way  of  measuring  educational  achievement. 

While  the  mechanics  of  handling  test  administration  and  test 
scoring  have  readily  been  adapted  to  our  "modern  era"  by  the  use  of 
optical  scanners  such  as  the  DIGITEK,  MRC ,  and  IBM  machines,  few 
applications  of  mathematical  algorithms  have  been  made  for  selecting 
items  to  construct  a  test.  A  review  of  the  literature  has  revealed 
several  methods  (Wherry  and  Gaylord,  1946;  Webster,  1956;  Elfving, 
Sitgreaves  and  Solomon,  1959;  Flowers,  1965)  for  selecting  items  but  few 
test  constructors  have  published  descriptions  of  practical  applications 
of  the  procedures. 

With  the  ordinary  computing  methods  used  by  many  researchers,  the 


ti<  t  vi  vti  u  sjc ti39m  qvtJuqmoo  yu  ucio  &<i3  ifJ'Vi 


2 


labor  of  even  approximate  solutions  proposed  in  some  selection  procedures 
makes  the  techniques^  impractical.  With  the  development  of  high  speed 
electronic  computers,  exact  solutions  to  the  problem  may  prove  to  be 
quite  practical. 

There  is  a  prevalent  need  for  an  analytic  item-selection  pro¬ 
cedure.  Items  have  been  written  for  all  age  and  grade  levels.  Perhaps 
the  most  productive  persons  have  been  those  teaching  junior  high  school, 
senior  high  school  and  university  courses.  Many  pools  of  items  are 
already  available.  At  the  university  level,  published  lists  of  items 
(Hilgard,  1962;  Orleans,  1963)  are  available  and  many  instructors  have 
accumulated  items  throughout  the  years.  Item  analysis  techniques  have 
been  used  extensively  by  test  constructors  for  selecting  items  (Davis,  1951 
Nunnally,  1959;  Adams  and  Torgerson,  1964).  The  procedure  of  using  item 
analysis  data  becomes  tedious  and  time  consuming  when  several  item 
characteristics  are  evaluated  simultaneously  for  many  items.  Analytic 
procedures  are  required  that  use  a  greater  proportion  of  the  statistical 
information  available.  Use  of  approximate  solutions  should  yield  to  exact 
techniques . 

The  procedures  for  selecting  items  to  construct  a  test  seem  to 
parallel  other  psychometric  developments.  In  factor  analytic  theory, 
the  rotation  of  axes  has  been  a  perennial  problem.  At  first  rotations 
were  done  by  hand  with  simple  mechanical  aids.  A  theory  for  rotation 
was  presented  by  Thur stone  (1947)  and  his  followers.  Later  equations 
were  derived  that  were  subsequently  used  as  a  criterion  for  rotation 
(Carroll,  1953;  Newhaus  and  Wrigley,  1954;  Kaiser,  1958;  Saunders,  1960). 


< 

.IsoiJoaiq  aJiup 

-  4  20  3 Da  a .2-  32 1  ol  \[n«£  n*  rol  base  »  e  va  qft  el  •  »:fT 

.8l3V9l  :■£>£»  :  '  k  U*.  :  )  I  ai'3lT  3  d  S’/f.  i.  Jl 

.a  Jl  y.  r.  i  joH  yle:  :».ens  3 1  i/mi  8  b9Jfvjiava  Jia  adlJelJa JaaiariD 

. B9uplnrio93 

(  3  X 


3 


As  a  result  of  the  increased  computational  effort  required  for  even  a 
small  problem,  the  use  of  computers  was  introduced  to  make  the  procedure 
feasible.  Some  analytic  procedures  are  being  used  in  test  construction 
but  in  these  procedures  considerable  information  is  not  used.  Tedious 
procedures  are  still  in  use.  Slowly  there  is  evolving  statistical 
methodology  that  is  finding  application  because  electronic  machines 
have  reduced  the  necessary  hand  computations  to  a  minimal  level.  The 
procedure  suggested  by  the  writer  centers  upon  an  analytic  method  for 
selecting  items  from  a  pool  of  items.  Although  the  proposed  method  for 
selecting  items  involves  a  considerable  amount  of  calculation,  this  does 
not  pose  a  problem.  "As  the  computer  revolution  continues  in  psychometrics, 
we  can  expect  objective  algorithmic  methods  to  become  the  rule  rather 
than  the  exception"  (Green,  1966,  p.  444).  The  need  for  an  automatic 
analytic  item- select ion  technique  is  evident  in  our  educational  system 
where  we  continually  construct  new  tests  and  modify  our  old  ones. 

The  proposed  selection  technique  is  to  some  extent  flexible  for 
individual  users  who  require  a  test  with  specific  characteristics. 

Within  limits,  a  desired  reliability  and  validity  estimate  of  the  final 
test  can  be  obtained  by  selecting  the  appropriate  items.  This  can  be  done 
automatically.  Factors,  from  a  factor  analysis  of  the  items  and  criteria, 
are  used  as  a  basis  to  select  the  items  to  construct  a  test.  Consideration 
has  been  given  to  developing  a  practical  and  objective  method  for 
selecting  the  "best  subset"  of  items  from  a  given  pool  of  items  even  if 
hypothetical  criteria  must  be  devised. 

The  proposed  selection  method  is  perhaps  better  designated  as 
the  revision  of  a  test  with  n  items  into  a  test  with  k  items  (k<n)  since 


« 

subaooiq  arf3  a^am  G3  bsouboianl  etv  aiaSuqmtoo  So  aao  »d3  ,j  . J do* q  llama 

£•1  ;a*r  bJt.  re  >  r>  esv.  dvj  i  a  1  gnl^aaiaa 

.meldoiq  a  aaoq  3on 

.  .q  iddCl  , naasO)  noJt3qaoxs  ad3  cari3 

lBnci:3£OJjb£<  truo  nJt  3nsbiva  a i  otrpJ  :r'ja3  noi3oals8-<ns3Jt  ol3\ 

.ssno  ilo  mo  i  iXboiT.  bntj  B3a  J  «  hi  :  u.  ianoa  ^XXbu  ianoo  5  iswiv 

*ol  ^Id'XsIS  3;;i3x&  amoe  o3  al  a  rptnrfoed  jfoJtiaelaa  baaoqoTg  arfT 

* 

1is3J:to  bna  amadl  9fi3  So  eie^isus  *o3oai  a  aoiS  (e303oa? 

.  ..  0  j  o)  03  */rca3i  i:  in  It  r-3  Ei  ■'  d  a  as  £  .  u;  £»*/ 

Si  nava  £ni93i  So  Xooq  ci'  /ij  a  a.oi3  emaai  So  "jy  3a-. >d’  sdi  ?al3oaJ aa 


4 


the  ’item  pool’  idea  generally  does  not  require  that  all  items  have  been 
given  to  the  same  group  or  in  the  same  test.  Thus,  a  restriction  is 
placed  upon  the  definition  of  an  item  pool  as  used  in  this  study  in  that 
only  items  that  have  been  administered  to  the  same  group  of  subjects 
and  within  the  same  test  format  will  be  used  to  form  the  item  pool. 

The  solution  is  amenable  to  computation  by  an  electronic  computer. 
A  number  of  factors  complicate  the  item  selection  problem  not  least  of 
which  are  test  reliability,  test  validity,  and  the  resulting  test  score 
distribution. 


CHAPTER  II 


GENERAL  PROBLEM 

Many  objective  items  are  now  available  to  form  pools  of  items. 
However,  because  of  the  varying  nature  of  the  presently  available  items, 
it  is  not  sufficient  to  merely  collect  items  and  form  pools  of  items. 

Prior  to  the  inclusion  of  an  item  in  a  pool  each  item  should  be 
inspected  by  a  sophisticated  judge  to  determine  whether  obvious  flaws 
are  apparent  in  the  item  construction  (Ebel,  1965).  Common  character¬ 
istics  to  be  investigated  are  the  precision  with  which  the  problem  and 
solution  are  stated  and  the  appropriateness  of  the  item  being  made  a 
part  of  an  item  universe.  The  next  step,  following  the  administration 
of  the  items  to  a  group  of  subjects,  is  a  statistical  analysis  of  the 
items.  An  item  analysis  will  provide  further  information  about  the 
quality  of  an  item.  Although  consideration  must  be  given  to  the 
statistical  and  authoritative  judgements  in  evaluating  items,  the 
collection  of  items  should  be  carefully  constructed  in  such  a  way  as  to 
sample  broadly  the  desired  content  and  educational  objectives  of  the  pool. 
The  content  and  educational  objectives  should  be  described  prior  to  the 
writing  and/or  selection  of  the  items  for  the  item  pool. 

At  present  the  main  procedure  for  selecting  items  to  construct 
a  test  is  item  analysis.  While  several  statistical  item  characteristics 
are  obtained  from  an  item  analysis,  the  technique  does  not  readily 
lend  itself  to  provide  an  answer  as  to  which  is  the  "best"  item,  "second 
best"  item  and  so  on.  What  is  required  is  an  analytic  method  for  selecting 


.(eaei  ,X9db)  noilDirrlanoo  ins-l  kii  nl  inaiBqqs  916 


9 xi3  3oodii  noi^f  taoirri  1  filial  sbivoiq  Ill*  ele^Iana  m  il  aA  .  .  i?  i  i 

oi  as  ■  rio/t  ni  : ^louil  •  o./  x  j.iiiaiao  dc  Liao.-  >  ici  *  o.LXqo 


00  adl  Ho  89V±3ostfio  Xs.  ollaoiiba  bn£  Inslnoo  JbaiJtssb  aril  \  Ilqmaa 


ol  loiiq  b  dl  :oaeb  ai  hJ  icu  i  B»vil~  ':fro  laaubf*  bi  :  laa-inco  adT 


louil&f  oo  ol  small:  gniloaXae  to!  9*iubsooiq  o.:  am  arfl  ineaaiq  3A 


:mi  Id  mall  J a  itleJtJjr.a  i:  .  ae  alii  W 


6 


the  best  subset  of  items  from  the  available  pools  of  items » 

While  item  analysis  procedures  can  be  used  to  evaluate  the 
statistical  acceptability  of  an  item,  the  item  parameters  obtained 
are  not  in  a  form  that  permits  a  test  constructor,  in  many  situations, 
to  decide  readily  which  item  is  the  "best"  item  of  a  pair  of  items. 

The  problem  becomes  extremely  more  complicated  when  there  are  several 
items  of  which  only  the  few  "best"  items  are  to  be  selected.  It  is 
assumed,  for  the  purpose  of  the  present  discussion,  that  the  relative 
importance  of  the  content  and  learning  objectives  in  the  area  under 
examination  have  been  considered  in  relation  to  each  item  that  was 
subjected  to  an  item  analysis.  In  addition,  the  most  general  meaning 
has  been  intended  in  the  use  of  the  term  item  analysis  since  "there  is  no 
one  type  of  item  analysis  data  that  is  best  under  all  circumstances" 
(Davis,  1951,  p.  297). 

A  procedure  commonly  used  in  test  construction  is  that  of  writing 
each  item  on  a  card  and  then  adding  the  relevant  item  analysis  datum  as 
it  is  collected.  In  this  way,  a  pool  of  items  is  obtained.  When  a 
new  test  is  to  be  constructed,  the  test  constructor  may  select  items 
from  the  pool  that  are  acceptable  to  him.  The  items  are  then  used  to 
construct  a  test.  Although  the  above  procedure  has  merit,  two  major 
problems  exist.  The  first  problem  is  the  difficulty  of  interpreting 
the  relevance  and  significance  of  the  item  analysis  parameters.  If 
the  items  have  been  administered  in  several  different  tests  to  many 
groups  of  subjects,  the  item  parameters  are  not  directly  comparable. 

A  second  problem  is  the  dimensionality  of  the  test.  Without  a  stat¬ 
istical  analysis  of  the  selected  items,  no  knowledge  is  available 


»rf3  ftisi/Ieva  07  t  <au  ad  n&o  ao'iube ooiq  eIb^J  \cil  mei)  fctJtrtW 
h*  iBddo  ?  3  <(r  [6q  i!  3 3 1  i>il3  1 139 d  1  os  io  \  •'••;.  tdajqaoox.  Lr.oi  :h!  3a3e 

.cjnvU  io  3iaq  a  io  m93i  M3aadM  ad3  ei  fl»3  1:  f)>ifiw  \  r b*9^  sblosrb  o3 


<  ,<  ::  r.J  o>  ;  ri  #s.o  cj  * o  J  .  E03  n.  t-  i  b:  j  »d  s*v  -1  4  oLosaii  '-.:•» 

'V.:»:»nj»i:'  03io  XI  :  lehnu  laoc,  L  .  .13  s.-f  b  aXa^Xaoa  *»3i  if  »q  3  sno 


.<W  .q  ,1261.  .aXvad) 


Iff  ■; 


c  :  rauoab  Eieylari.  cr^3i  3r  >1*3  srf'3  goibba  fi.jd3  >nr  biso  f  no  fiaji  do 

«1  nlBJ'vo  ar  P"  1 1  t<  ,  a  aid  nl  . baloslloo  eX  Jl 

Jx  3  )I  e  ioM  433:  oo  3t  •.  s.  n3  ,  fc=*  rj  38  >o  ad  o3  ai  3893  wan 

lotasa  ow3  ,3i39u  aari  a  iobdooiq  bvxk'b  sri3  rigoofUIA  .3893  s  3ouo3enoo 
jif.'tJ'  iqi  Jn  Io  x  >i  lib  :i  3  =.  fii*iHo3q  1  il  ar. 

.8393a  a  q  eiaxlBae  ru  3l  erf3  io  aocftoiliagie  bna  9onBvsl93  9 rf3 

^089  s.i  J  *•.  J  3-01  .ill  >u  lined  avxrf  /n*3l  -*ri3 

l  '.  0  :q  '  ■  '  o  \ 

sldaiiava  al  ontf  on  teaio3X  bsiosila;-  sd3  io  fciaxlans  Xao±38l 


7 


regarding  the  nature  and  the  number  of  dimensions  the  test  will  have. 

An  analytic  technique  is  required  that  will  select  the  "best" 
item,  "second  best"  item  and  so  on.  The  procedure  suggested  here,  which 
utilizes  factor  analytic  theory,  for  selecting  items  is  not  intended 
to  replace  existing  item  analysis  methods  but  rather  to  extend  the 
analytic  approach  of  the  test  constructor.  Item  analysis  data  must 
be  inspected  before  an  item  becomes  part  of  a  pool  of  items.  After  a 
pool  of  items  is  formed,  the  proposed  algorithm  may  be  used  to  select 
items  according  to  the  specific  criterion  established  by  the  test  con¬ 
structor  . 

The  summary  statistical  information  available  on  a  particular 
item  embedded  in  an  adequately  defined  data  matrix,  coupled  with  the 
power  of  an  electronic  computer,  should  result  in  a  superior  procedure 
for  item  selection  than  item  analysis  procedures.  It  is  difficult  to 
determine  the  relative  importance  of  the  item  validity  coefficient  and 
the  discrimination  index  when  comparing  many  items.  The  added  infor¬ 
mation  of  dimensionality  and  location  of  an  item  vector  in  an  item  and 
criterion  space  provided  by  factor  analysis,  is  lacking  in  item  analysis 
methods.  Therefore  more  information  is  available  through  the  factor 
analytic  technique  than  through  standard  item  analysis. 

A  primary  consideration  is  meaningfulness  and  practicality.  Any 
method  should  require  minimal  effort  on  tedious  nonprof itable  tasks. 

This  can  be  done  by  using  electronic  equipment  and  suitable  numerical 
procedures.  By  providing  a  general  solution,  variations  desired  by  the 
test  user  may  be  developed  by  specifying  required  parameters. 


"iestf"  Qtil  losiee  IIlw  36d3  el  aapifuiosJ  ftA 

.io.iouv*  35  see  ©a*  ;o  rfsao^qqa  oi  J^Bn6 

-«© d  3es3  arf3  bsrisjtldaSa*  noils' iiio  Diliosqe  ©fTJ  03  $Mbirt>©bs  te®  8* 

..lt;3  3tnr3e 

*xalu©i3-  •  b  no  sic©  i--.vs  nc  ::  ‘  olio:  nk  tJtaeae  t**2”®6'  df^ 

, soiLbsooiq  eie^Xi-rxE  flfirtX  nolSos  r  si  raanr  iol 


br.  md3i  nn  1  'id  £>v  v  3  t.  »e  So  a©l3J  ooi  brtr:.  ^3iI«JXOl:m©r  lb  ncJ  3 tua 

■  a  t  obA  -  >>  :.;  j'B ir  b- blvo*q  >o*Co  niMna 

•Id:  vi  '©Ifj  rf  -Xded  i  ivE  ;  t  no  i;:,  ’rlftjt  '  sort  raoiaiffr  C  . *. ;.  ori3o« 

illaolSbinq  bms  ©ssnif^sfl-hra©*  jX  do :3s*i»blEit©3  ^prsmisq  A 


sri3  vd  b  lit  3  »noiJai^nv  ,no  luloa  J  r»  »c*g- 6  gakblvo^q' 


.1  :is)9s  fiinq  .  esiifpts  rlv  3»<  5  vl  baqo  L^veb  3l  TB&u  3c©3 


8 


In  the  proposed  method,  the  onus  is  on  the  test  constructor  to 
provide  estimates  of  certain  desirable  test  characteristics.  A  test  is 
constructed  by  selecting  items  to  form  a  test  which  is  dependent  upon  the 
established  tolerance  limits  set  by  the  user  and  the  nature  of  the  item 
pool  from  which  the  items  are  to  be  selected.  The  constructed  test  is 
to  be  an  acceptable  approximation  of  a  postulated  hypothetical  test. 

Each  user  of  the  proposed  technique  should  be  able  to  construct  his  own 
pool  of  items  which  satisfy  certain  conditions  deemed  necessary.  Such 
flexibility  should  be  allowed  within  an  analytic  system.  However,  while 
the  selection  of  items  can  conceivably  be  carried  out  by  the  proposed 
algorithm  on  an  electronic  computer  and  thus  result  in  a  test  being 
constructed  that  has  been  "untouched  by  human  hands",  it  is  undesirable 
to  have  a  test  constructed  that  has  been  "untouched  by  human  minds". 
Decisions  regarding  criteria,  type  of  item,  length  of  a  test,  and  final 
test  characteristics  remain  the  jurisdiction  of  an  "informed"  test 
constructor. 

Lumsden  (1961)  prepared  a  general  survey  of  the  construction  of 
unidimensional  tests.  Greatest  emphasis  was  "deliberately  placed  on 
item  selection  rationale  since  this  topic  appears  to  have  been  rela¬ 
tively  neglected  in  the  literature  of  the  problem"  (Lumsden,  1961,  p.  130). 
The  general  conclusion  advanced  was  that  only  factor  analysis  provides 
a  rational  procedure  for  item  selection  in  the  construction  of  uni¬ 
dimensional  tests. 

The  theoretical  development  of  measurement  theory  has  primarily 
been  concerned  with  unidimensional  tests.  It  is  the  writer's  contention 
that  most  achievement  tests  and  ability  tests  are  multidimensional.  If 


8 


03  lolomlaaoo  laal  aril  no  al  anno  aril  .borflam  bSEoqo3q  aril  al 
,j  js.,;.  A  .BDilallelOBTSriO  3893  SldBll«*b  «Xa3M0  *0  *»3*r.Xl*a  atiWlq 
,ri3  noqn  laabaaqab  ai  riolriw  laal  s  anol  o3  am all  gaitosls*  *1  balonManoo 


msjl  aril  lo  .mlan  aril  bas  isau  aril  Kd  «•  •>»*«  •®“~XoS  bsriaiXdsl*. 
B±  Iasi  i  idoii-ilanoo  arfT  .baloaXaa  ad  o3  eie  email  aril  riolriw  noil  ^oq 
.3e6j  lAoilarfloqifrf  balaXolsoq  a  lo  aollBmixoiqqB  aidslqs >db  aB  ad  oi 
six  300  .13000  ol  aide  ad  blood*  auplnrioai  baaoqoiq  aril  lo  t*eu  dos3 
rion2  .-(IBaasoan  bamaab  aaollibcoo  nialiao  ^lallaa  riolriw  email  30  looq 
alxriw  .aavawoH  .mala^a  ol  solans  na  olril.lv  bawolla  ad  biooria  villdlxaXl 


baaoqoiq  aril  xd  *>al”E3  sd  'fldaviaonoo  nao  amsli  lo  nolioalsa  aril 

,*a±ad  leal  e  nl  iXoaai  auril  baa  laluqmoo  olaoiloaXa  ns  no  mrilliogXB 

aXdaiiasbna  ai  li  ."abnarf  nanud  *d  barioaolau"  naad  and  laril  baloaiiaaoo 

. "abnlm  namnrf  i£d  barfoaolaa"  nasd  aad  larii  balouilanoo  3aal  a  avari  ol 

Xanli  ban  ,3**1  a  lo  rfigaal  .mail  lo  aq*3  .aliallio  galbiagai  aaoialoaa 

laal  "bamiolni"  aa  lo  noliolbaliut  aril  aiamai  aollaiialaBiBrio  leal 

.losmnsenoo 


lo  nollouilanoo  aril  lo  ijaviua  Xaianag  a  baisqaiq  (IdCX)  nabamuJ 

no  baoaXq  xlalBiadiXab”  saw  aiasriqma  laaiaaiD  .alaal  Xsaoianamlblau 
-aXai  naad  avsri  ol  aiaaqqs  oiqo*  alril  aonia  aiaaoXlBl  aolloalaa  mall 
(0  1  .q  .Xdex  .nabamuJ)  VsXdoiq  aril  lo  aiuiBiallX  aril  ol  baloaXga*  'CXav-U 
aablvoiq  ala^lane  loloal  yXno  laril  bbw  baoriBvba  noiaaXonoo  leianag  aril’ 

-laa  lo  noilonxlanoo  aril  ni  aolloalaa  mall  lol  aiabaooiq  XsaollBi  a 

Xfinoianaraib 

V  ;j  II.. Bill,  af.ri  vioaria  InamaiaeBam  Jo  inamqoXavab  laotlaioarii  a'U 

nollislnoo  s'lailiw  aril  el  31  .alaal  XanolBOamlblau  rfllw  baaiaoaoo  naad 

IX  . ianoianaml billon  aia  alaal  *>Ulda  baB  alaal  inaoavalrioa  IB OB  laril 


9 


this  is  the  case,  we  should  be  concerned  with  selecting  items  based  upon 
an  awareness  of  the  multidimensional  nature  of  the  predictor  variables 
and  the  criteria.  The  selection  of  items,  to  construct  a  test  from  a 
pool  of  items,  is  not  generally  done  by  random  selection  procedures. 
Consideration  of  difficulty  and  a  discrimination  index  for  each  item 
enters  into  the  decision  as  to  the  selection  or  the  rejection  of  an 
item.  In  addition,  each  dimension  of  a  multidimensional  test  should 
receive  recognition  in  the  item  selection  process.  A  special  case  of 
selecting  items  from  a  multidimensional  space  occurs  when  the  item  and 
criterion  space  is  defined  as  unidimensional. 

As  in  most  methods  of  test  construction,  the  "criterion  problem" 
must  be  considered  in  the  proposed  item  selection  procedure.  In  an 
effort  to  structure  the  criterion  problem  and  some  possible  solutions, 
Astin  (1964)  presented  a  paper  concerned  with  clarifying  certain  issues 
regarding  criterion  measures  and  their  use  in  educational  and  psycho¬ 
logical  research. 

Although  Astin  deals  with  the  problem  of  multiple  criterion 
elements,  the  multidimensionality  of  criteria  and  the  uniqueness  of 
criterion  measures  found  in  some  studies  (Ryans,  1966;  Kelly,  1966) 
suggests  that  greater  effort  should  be  made  to  produce  an  acceptable 
procedure  for  dealing  with  multiple  criteria.  Gulliksen  (1950  b)  has 
suggested  that  the  most  information  about  the  criterion  is  available 
when  a  comprehensive  matrix  of  intercorrelations  including  both  pre¬ 
dictor  and  criterion  variables  is  utilized.  Although  Astin  is  critical 
of  Gulliksen* s  recommendation  in  that  it  "involves  circular  reasoning 
or,  at  best,  misuse  of  terms"  (Astin,  1964,  p.  812),  Rozeboom  (1966) 


88idB*„v  mialbaiq  fidi  lo  JmlaomlLlli*.  adl  lo  833n,«wB  na 

B  mil  IMS  B  iouil'snoa  o>  ,•<*•»!  lo  nollo*  »«  «4  .aliaix-io  .  t  u 

.  ,«,hB30lq  OOilT-aXoa  MOtaBl  *4  MOb  ylXBl:  ^  10O  -1  5<>  "'0<! 

,  (JJt  rfoBB  id  xsbnl  oollBBiBUiPBlb  b  boa  ylXobllllb  1.  noi3BX.bl.naO 

bloorte  3a.3  iBOOiBOBolhlllitt.  .  lo  nolBnenlb  ds«»  ."olllbba  nl  .m*> 

bas  «»3l  ort3  fisriv  exuooo  SDsqs  XanolBnottlbilXun  b  noil  *“  1 '  “ 

♦Ififlol  n atokbhau  a*  bdxrt^ob  el  sosqa  no 1  s  *  ro 

'Wdoiq  flonsilio"  9d3  ,nol3oui3eno3  jbbJ  lo  .bodle*  Iboo  nl  -eh 


saolluloa  sidlaaoq  .bob  boa  aaidoxq  noli.Hio  »ri3  Binlouxla  03  J3o5i» 
8soaai  n.Bl  o  golyllinio  H*1W  b.msonoo  »q»q  •  *»»*•  ‘  «  *‘dS' '  Bl3,A 

lo  ess/iBuplno  od3  bna  axisllio  lo  y3li6/iol«nsiiilb  3  o«  ■  -  ,c  nmrU 
(Jdex  .yXXB*  ;ddCl  .sn  iyfl)  eslbuis  smoa  at  bnnol  bbiubbod  oolia-^o 


-»iq  d3od  gal  buXoax  »nol3.X.Tl€01.3nl  lo  xIUbb  BVlenBdoiqeoo  B  a»dv 
,  -1  Tilf.fi  XgUorfSiA  .b  -BiJ  llU  i  30  itl  v  nOllBlxM  :  -omit 

^laoBBBi  iBXuoilo  BBvXovnl"  31  3Bri3  nl  noliabnamoiooax  a'amil-  toD  ° 
(»WX)  moodasofl  .(4X6  .q  ,«eX  .nlleA)  "a«r»a  lo  Baueix.  .38»d  si  ,10 


10 


appears  to  be  in  agreement  with  Gulliksen.  Rozeboom  suggests  that 

the  concept  of  "validity"  (can  be)  generalized  from  predicting 
a  single  criterion  to  predicting  within  a  space,  S  ,  of 
criterion  variables  .  .  .  which  makes  clear  that  the  concept 
of  test  penetrance  into  criterion  space  is  a  natural 
generalization  of  single-criterion  validity  theory  (1966, 
pp.  442  -  443) . 

The  writer  is  in  agreement  with  Rozeboom  and  has  incorporated  the 
suggested  "space  concept"  of  criterion  variables  into  the  item-selec¬ 
tion  method. 

Thorndike  (1949),  Gulliksen  (1950  a)  and  Astin  (1964)  are  in 
agreement  that,  in  the  final  analysis,  a  judge  or  panel  of  "experts" 
must  decide  on  rational  grounds  how  relevant  each  element  is  to  the 
conceptual  criterion.  The  relationship  between  qualitative  and  quanti¬ 
tative  decisions  is  especially  relevant  to  the  "criterion  problem". 
Gulliksen  aptly  summarizes  the  situation  in  saying  that 

mathematical  procedures  are  appropriately  used  when  they  serve 
to  guide  thought.  If  an  attempt  is  made  to  utilize  such 
routines  as  a  substitute  for  thought,  we  may  unwittingly  arrive 
at  and  accept  absurd  conclusions  (1950  a,  p.  351). 

Since  in  the  method  proposed  an  hypothetical  test  must  be 

specified,  in  the  form  of  weights  assigned  to  each  criterion  factor,  each 

constructed  test  should  be  more  acceptable  and  meaningful  to  the  test 

user  than  existing  tests  prepared  by  other  methods.  As  items  are  added 

to  the  item  pool,  the  factor  analysis  of  the  included  items  will 

reveal  any  change  in  the  nature  of  the  pool.  Thus,  some  control  can  be 

maintained  over  the  inclusion  of  items  similar  and/or  different  from 

those  in  the  existing  pool.  It  is  thus  at  the  discretion  of  the  user 

whether  the  item  pool  should  remain  the  same,  in  a  factor  space  sense, 

or  be  changed  in  a  specified  manner.  Therefore,  with  knowledge  of  the 


■ 

-,.i  a,  I  olot  *  W»i»v  noliadi™  lo  '  ' 


••8»aqxs*  lo  Xaaaq  10  agbut  s  .alayXana  laa**  »*  .»-«  xn*™»8B 
.'Wdoiq  aoliali™"  sriJ  oJ  InavaXai  yXXala.qaa  al  8aoXalaab  avilal 


.vine  yXgnXdliw™  yam  aw  .idguorfl  *>1  alalUadw  a  aa  *9“ J  ’  * 

.  (xee  .q  <6  oeei)  snoiaulonao  !n  ad#  3q&3M  *  J 


t*B  a-  »  a..J±  «A  .«bodS9»  iad»o  yd  baiaqaiq  laal  tolJaX,  ■  «arfi  *»*i- 

diI  am  .all  io\bne  laXinla  a«a»i  lo  ool.^Xaoi  aril  *>v»  ba.iXaloian 


11 


factor  pattern  of  the  item  and  criterion  space,  as  well  as  the  associated 
dimensions,  more  information  is  being  used  in  selecting  items .  This 
should  result  in  improved  test  construction  practices. 

An  immediate  reaction  can  occur  from  test  specialists  regarding 
the  assumption  of  homogeneity  of  items  or  unidimensionality.  One 
solution  to  the  problem  of  working  with  multidimensional  tests  is  to 
consider  each  test  as  if  it  was  a  sub-test  of  a  test  battery.  However, 
test  constructors  should  be  made  aware  of  the  fact  that  many  tests  are 
multidimensional  while  being  considered  as  unidimensional.  The  number 
of  dimensions  that  an  item  and  criterion  space  occupy  is  defined  as 
being  the  number  of  orthogonal  vectors  required  to  span  the  space  as 
defined  by  the  item  and  criterion  intercorrelation  matrix. 

Initially  the  proposed  procedure  uses  the  intercorrelation 
matrix  of  items  and  criteria  as  basic  data.  The  criteria  are  not  necessary 
but  are  desirable  in  providing  for  introducing  an  item-criterion  space 
when  the  intercorrelation  matrix  is  factor  analyzed.  After  the  m 
factors  of  interest  have  been  extracted,  the  test  constructor  is  required 
to  assign  a  relative  weight  to  each  factor.  A  factor  is  a  construct, 
a  hypothetical  entity  that  is  assumed  to  underlie  tests  and  test 
performance.  The  interpretation  and  naming  of  factors  calls  for  psycho¬ 
logical  insights,  before  and  after  the  factor  analysis  is  made,  in 
addition  to  statistical  understanding.  Since  the  test  constructor  is 
familiar  with  the  test  items,  he  should  be  able  to  assign  meaningful 
names  to  each  factor.  The  weighting  is  an  indication  of  the  relative 
importance  of  each  factor  to  the  test  constructor’s  conceptual  criterion 
which  is  an  hypothetical  test  vector  in  the  item  and  criterion  space. 


. 

■ 

«V3«oH  3aal  m  lo  i.ol-d*  8  B6W  11  11  e.  3B93  dO«9  IBblBHOO 

' — 1  X 

■ 

;t  .Si  '■!  '  ■>>  -  "  I  'a  ’  '<•  •  't!' 

•mmm  ton  .16  61193110  «tt  .636b  olaed  at.  aliallia  boa  omtl  o  xliir^ 
aoaqa  noii93lio-«  U  na  ^.looboilni  lol  snlbiv  ,  al  aldailaob  •»  lud 
m  arid  19: 1A  .bssylane  loloei  *1  til-  »  collolailooislnl  9d3  nsriw 

. 

iX  9, 1C  103061  arid.  tall*  bira  #ioled  ^3ri8l*nl  » 

- 

lui.iUnasm  ml  raa  ol  side  ad  bXuorfe  ad  .<  :ali  «•»  adl  rillw  lal  ot ' 

■ 


12 


A  rotation  of  the  factor  matrix  results  in  factor  one  being  collinear 
with  the  hypothetical  test  formed  by  weighting  each  factor.  In  this  form, 
the  loadings  of  each  item  on  factor  one  is  the  correlation  of  the  item 
with  the  hypothetical  test.  With  consideration  given  to  the  size  of 
the  loading  on  factor  one,  the  communality  of  the  item  and  the  angular 
displacement  of  the  item  from  the  hypothetical  test,  it  is  now  possible 
to  select  items  to  construct  the  desired  final  test  that  will  have 
properties  similar  to  the  hypothetical  test. 

Some  provision  must  be  made  for  up-dating  the  pool  of  items. 
Although  items  in  raw  score  form  for  the  same  subjects  can  readily  be 
added,  the  procedure  for  introducing  additional  items  into  a  factor 
space  is  more  complicated.  Fortunately,  Wherry  and  Winer  (1953), 

Fruchter  (1954)  and  Fruchter  and  Jennings  (1962)  provide  partial  solu¬ 
tions  for  this  problem.  The  use  of  correlation  matrices  with  missing 
data  may  be  another  approach  to  up-dating  a  pool  of  items. 


„oi  eWJ  oi  rio«s  satoa*i»»  »•»*  ^al5sd3o<reri  *d3  riIiW 

•  IS9J  IsaxJsrfioqyd  »rf3  oi  B»xJ:r;<j<nq 

.«..■  ji  lx  Xooq  «  snUBb-qu  OJ  dwoiqq*  i»riSonB  ad  v.  ■» 


CHAPTER  III 


MEASUREMENT  IN  EDUCATION  AND  PSYCHOLOGY 

In  any  scientific  approach,  the  first  concern  is  with  stating 
the  problem  in  clearly  defined  terms.  The  problem  should  then  be 
systematically  approached  within  a  framework  of  theory.  A  theory  is 
defined  as  a  deductively  connected  set  or  system  of  related  conceptions 
in  agreement  with  known  properties.  The  body  of  knowledge  thus 
acquired  provides  generalizations  and  laws  that  can  be  applied  to  the 
solution  of  a  range  of  problems. 

The  general  presentation  in  chapter  III  provides  a  brief  outline 
of  test  theory  that  can  be  used  with  qualitative  and  quantitative 
criteria  to  develop  tests.  After  a  test  has  been  constructed,  the  next 
logical  step  is  to  assess  how  well  the  objectives  have  been  met  that 
were  used  to  prepare  the  test.  Chapter  IV  contains  a  discussion  of  item 
and  test  score  characteristics  relevant  to  the  assessment  of  tests. 
Validity,  reliability  and  the  related  interdependencies  to  item  and  test 
score  parameters  are  discussed. 

A  review  of  measurement  theory  and  related  literature  is  presented 
primarily  as  background  material.  Although  the  problem  being  investigated 
is  related  with  all  aspects  of  measurement  theory,  the  presentation  of 
relevant  background  material  does  not  lead  directly  to  the  problem  and 
proposed  solution.  However,  the  material  presented  in  chapters  III  and 
IV,  especially  test  theory,  reliability  and  validity,  is  directly 
relevant  in  evaluating  item  selection  procedures,  and  therefore  complete 


. 


B  n989,q  el  atuii  I83il  be!  19T  bat  yiosris  JoaeraiueasB.  o  welvsi  A 


14 


tests.  Without  reference  to  the  theoretical  foundation  of  measurement 
and  the  concepts  of  reliability  and  validity,  one  would  find  it 
difficult  to  evaluate  the  item  selection  procedures  reviewed  in  Chapter  V. 
Similarly,  the  theoretical  development  of  the  proposed  selection 
technique  and  the  evaluation  of  it  follow  directly  from  measurement 
theory. 

A  second  purpose  for  reviewing  measurement  theory  rests  with 
the  need  to  insure  that  users  of  analytic  procedures  for  test  con¬ 
struction  are  fully  aware  of  the  need  to  assess  tests  from  a  theoretical 
basis  regardless  of  the  manner  by  which  the  test  was  constructed. 

Analytic  procedures  must  not  misdirect  users  into  believing  that 
measurement  theory  is  absolute  or  that  they  are  not  obligated  to  apply 
criteria  additional  to  those  applied  analytically  to  the  evaluation 
of  the  final  test. 

There  is  a  need  to  continually  reassess  the  quality  of  a  test  in 
terms  of  reliability,  validity,  and  test  score  distribution  whether 
the  test  items  are  selected  by  the  computer  or  by  a  human.  Such  assess¬ 
ment  cannot  be  made  without  a  basic  understanding  of  measurement  theory. 

"Measurement  means  the  description  of  data  in  terms  of  numbers 
and  this,  in  turn,  means  taking  advantage  of  the  many  benefits  that 
operations  with  numbers  and  mathematical  thinking  provide"  (Guilford, 

1954,  p.  1).  A  product  of  measurement  is  a  meaningful  quantitative 
description  given  in  terms  that  directly  convey  some  notion  of  the 
frequency,  amount  or  degree  to  which  the  individual  manifests  some 
property.  Thus,  "the  scores  are  expressed  in  such  a  way  that  certain 


;oi  *  .  ;  io  !  &»et/  six/ani  o3  «*n  J 

.n&nujrf  b  yd  10  JoJvqaoD.  adi  b*3o*la a  £i£  cm&3l  isoi  s 

aiddmu  io  eona^  nJ:  a3£fa  io  noUqlioaab  9b  i  aftfcia 

. 


15 


characteristics  or  qualities  of  the  individual  are  immediately  manifest 
in  a  quantitative  sense"  (Ghiselli,  1964,  p,  44). 

In  addition  to  quantitative  description,  or  measurement,  use  is 
made  of  qualitative  description,  commonly  referred  to  as  classification. 
"All  variables  can  be  classified  into  one  or  the  other  of  two  general 
types,  those  which  are  qualitative  variables  and  those  which  are  quanti¬ 
tative  variables"  (Ghiselli,  1964,  pp.  11  -  12).  Qualitative  variables 
are  nominal  variables  whereas  quantitative  variables  can  be  subdivided 
into  ordinal  variables,  interval  variables  and  ratio  variables. 

Measurement  at  best  only  provides  information  by  the  process  of 
assigning  numbers  to  individual  members  of  a  set  for  the  purpose  of 
indicating  differences  among  them  in  the  degree  to  which  they  possess 
the  characteristic  being  measured.  Evaluation  is  a  judgement  of  merit 
that  is  sometimes  based  solely  on  measurements  but  more  frequently 
involves  the  synthesis  of  various  measurements  and  subjective  impressions. 
Evaluation,  the  more  recent  term,  includes  the  concept  of  measurement 
as  used  in  education  and  psychology.  However,  measurement  does  not 
necessarily  imply  evaluation.  "Evaluation  assumes  a  purpose,  or  an 
idea  of  what  is  'good'  or  'desirable'  from  the  standpoint  of  the  in¬ 
dividual  or  society  or  both"  (Remmers  and  Gage,  1955,  p.  21). 

Test  Theory 

Axioms  and  Principal  Results.  Directly  related  to  measurement 
is  a  basic  model  of  test  theory.  One  fundamental  notion  is  that  any 
observed  measurement  is  contaminated  by  an  error  of  measurement. 

Thorndike  (1951,  p.  568)  and  Cronbach  (1960,  p.  128)  have  attempted  to 


..  c -Its,  bns  .aldali.*  I* vis dot  >i9id->X«v  iaoibw  od«l 

io  a;  arid  yd  BO  timolol  esfi  role  ylao  a  ad  to.  tmmmi  vault 

.eaot  »/q  J  eviJa  »'•*  ‘  lni  «*>*•«*  • uo±,£V  io  •*•»«*«*  '  ***'iov»l 

da  a»i  lo  dqaonod  arid  *  toXoni  ,<ri*=  aoaoai  adoo  add  .acidoalavS 


16 


classify  these  errors  exhaustively.  A  word  of  caution  is  required  here. 
The  errors  referred  to  are  not  errors  due  to  drawing  a  sample  from  a 
large  population  of  individuals.  Such  sampling  errors  are  essentially 
independent  of  errors  of  measurement. 

An  extensive  review  and  extension  of  classical  test  theory  has 
been  presented  by  Novick  (1966) .  Novick  attempts  to  show  that  classical 
test  theory  may  be  placed  on  a  firm  theoretical  foundation  and  that  its 
necessary  assumptions  are  very  weak  and  hence  generally  satisfied. 

The  simplest  basic  model  is  the  classical  linear  model  in  which 
an  observed  score  X_^  can  be  divided  into  two  additive  components,  a 
"true  score"  T\  and  an  "error  score"  E^,  that  is 


+  E 


i 


It  is  assumed,  for  (E^) ,  that  we  are  dealing  with  random  errors,  normally 
distributed,  where  (a)  the  mean  Z?(E)  =  0,  (b)  covariance  ^(T\ »EL)  =  0, 

(c)  covariance  £(E^,  E^)  =0.  and  E^  are  random  errors  on  two  testing 

occasions  (Gulliksen,  1950  a).  E_  denotes  expected  values.  The  variance 
of  the  gross  observed  scores  is  then  given  by 

2  2  2 

S=  +  S 
x  t  e 

Gulliksen  (1950  a  )  has  shown  that  the  index  of  reliability  for 

a  test  is  the  proportion  of  true  score  variance  divided  by  the  observed 
score  variance,  that  is 


x  x. 
g  h 


s 


2 

t 


2 

s 

X 


XlXBltnBaaa  bib  a*o«9  gflUqrna  riooE  .alsoblylbol  lo  noise. uqoq  «S™* 

IboIzbbLo  dad*  wci  a  od  a*qr«Si  iolvoH  .(dd«X)  ialvoK  yd  te9A*»»iq  n**d 

rioiriw  nl  labour  i&an±I  laoiawklo  sd3  a*  labois  ol»ad  JealqoLta  ad 

a  irri-i  ,  u  n  ,v?  -.CTT*  n  .5/3  V  >-  •  “i;:  ! 

.a  +  .1  *  ji 


do  ariJ  yd  babivlb  dooaJtiBV  >  o  *in3  7  i  noi  x  q  j  i  s  e  a 


17 


where  x^  and  x^  are  two  parallel  form  measures.  It  may  be  shown  that 


2  2  2 

s  =  s  r  +  s 
x  xxx,  e 

g  h 


and  that 

s  =  s  1  -  r  , 
e  x  \  xx. 

g  h 

where  s  is  the  standard  error  of  measurement.  This  is  a  fundamental 
— e 

concept  in  test  theory  and  defines  an  important  characteristic  of  a  test. 

However,  validity  is  the  most  important  criterion  by  which  a 
test  may  be  judged  (Helmstadter ,  1964).  Validity  can  be  regarded  as 
being  composed  of  essentially  two  components:  the  accuracy  of  measure¬ 
ment  or  reliability  and  what  the  test  intended  to  measure  or  the  criterion 
for  the  relevance  of  the  test  (Cureton,  1950;  Remmers  and  Gage,  1955). 

Test  Score.  MA  score  is  a  number  assigned  to  an  examinee  to 
provide  a  quantitative  description  of  his  performance  on  a  particular 
test"  (Ebel,  1965,  p.  462).  When  a  test  contains  many  items,  the  raw 
score  of  an  individual  is  commonly  defined  as  the  number  of  items  that  are 
answered  correctly.  A  correction  for  guessing  or  a  differential 
weighting  system  for  the  items  may  be  applied  to  improve  or  refine  the 
raw  score.  There  is,  however,  not  complete  agreement  among  test  specialists 
concerning  the  questions  of  using  a  weighting  technique  or  a  correction 
for  guessing  (Traxler,  1951). 

Test  Score  Distribution.  Distributions  of  scores  vary  markedly 
in  their  shape,  manifesting  different  degrees  and  combinations  of  skew- 


31  .89tUB8e.nr  moi  X»ilf>i8q  o«J  910  f.x  bn  x  il»  ' 


*  Luoliiaq  a  no  Mflrunt  t33  1  *1A  lo  ,  ljq±*:«»b  »v**a*±*ii;-L»p  b  abi  oxq 


I8V  J:  fj  3  3 


18 


ness  and  kurtosis.  Early  investigators  seemed  to  think  that  there  was 
a  natural  law  for  human  abilities  to  be  normally  distributed.  Now,  it 
is  realized  that  such  a  statement  is  meaningless  since  the  shape  of  a 
distribution  depends  on  the  scale  of  measurement. 

Moment  statistics  can  be  used  to  summarize  and  characterize 
data.  The  most  important  set  of  moments  in  statistical  theory  is  ob¬ 
tained  by  calculating  moments  about  the  arithmetic  mean.  Two  of  them, 
the  arithmetic  mean  and  the  variance  are  in  common  use.  The  first  four 
moments,  commonly  called  deviations  from  the  mean  or  simply  deviations, 
are  defined  as  follows: 


Ul 

H2 

H3 


Zx 

N 

Zx 

N 

Ex' 

N 

L 

Ex 

N 


0 

2 

s 


where  x.  represents  the  deviation  of  each  score  from  the  mean  of  all  the 
scores . 

A  measure  of  skewness  defined  in  terms  of  moments  is 


gl  = 

1  T  /  2 

H2 


The  value  of  will  be  zero  for  symmetrical  distributions, 
measured  as  a  departure  of  g^  from  zero,  is  positive  when  g^ 
and  negative  when  £  is  negative. 


Skewness , 
is  positive 


t 

3i  ..oft  .bodudtuiaxb  yllaonon  od  oJ  esldllid*  obd;  1  “xo:  '-1  s 

' 

_  0  di  _  <;  i  Ifir.iJe.  3fiis  txi  gdfisuocn  ‘  ■.  t.  xoq.-l  ■  .or.j  9  . 

,n£9t  ollsaiil3li£  9dJ  3uo dfc  er;  .':om  Sn±3Bluo  bo  yd  :  mlad 

•  o  .  bs'  ,i  «•  :t  ■  ’  r  •  ■  T0  ■  ,r'  *  '  '  G01‘ 

;?-  o!Iol  8B  ;>9nj  idb  91B 


19 


The  degree  of  kurtosis  can  be  described  by 


g 


2 


=  ^4 
T 
P2 


"  3 


A  distribution  of  scores  is  leptokurtic  when  is  positive,  platykurtic 
when  i s  negative,  and  normal  when  =  0. 

A  normal  frequency  distribution  can  be  completely  described  by 
the  mean  and  the  variance  when  both  and  ^  are  zero.  While  it  is  con¬ 
venient  to  use  the  normal  curve,  one  must  remember  that  "very  few  of  the 
instruments  used  in  psychological  ’measurement'  involve  equal  unit 
scales  -  the  measuring  units  are  frequently  arbitrary  or  even  accidental" 
(McNemar,  1962,  p.  28).  It  would  seem  that  skewness  and  kurtosis  are 
partly  a  function  of  the  accidental  nature  of  the  measuring  units.  The 
values  of  and  ^  are,  however,  useful  for  descriptive  purposes. 

The  higher  moments 

.  .  .  have  relatively  little  use  in  elementary  applications  of 
statistics,  but  they  are  important  for  mathematical  statisticians 
in  the  study  of  the  properties  of  distributions  and  in  arriving 
at  theoretical  distributions  fitting  observed  data  (Hays,  1963, 

p.  186). 

The  means,  standard  deviations  and  intercorrelations  of  items  in 
a  test  have  a  very  important  bearing  upon  the  shape  of  the  total-score 
distribution.  If  the  items  are  relatively  easy,  a  negatively  skewed 
distribution  will  result,  whereas,  if  the  average  item  mean  (item 
difficulty)  becomes  lower  the  score  distribution  becomes  positively  skewed. 
With  items  of  medium  difficulty,  the  distribution  becomes  symmetrical. 

The  chief  effect  of  item  intercorrelations  is  upon  kurtosis. 

As  item  intercorrelations  increase,  the  distribution  of  total-scores 


SIB  a.tBOlIl»l  bflfi  BfiSnWS^B  IBliJ  It  -B  bit'-.  *'■  "  •  •<!  <  3"  *  '  '  !0f 

- 


v,  ,  soo  :  udi  :3qj  i: 


siom-IsSoS  aril  io  aqariB  Bril  noqtf  gmiBad  Inaiioqnit  qisv  a  svari  UM  a 

.1  usi  <e  -.03*4  ooisudii  •  ,  3.  «  tmb  wlbam  lo  aaaJljiJi' 

* 


20 


grows  flatter  from  mesokurtic  to  platylkurtic ,  to  rectangular,  to  bi- 
modal  and  finally  U-shaped  (Guilford,  1954).  When  item  intercorrela¬ 
tions  increase,  the  test  reliability  subsequently  increases  which 
usually  influences  the  validity  coefficient. 

Thus,  the  distributions  of  actual  test  scores  depend  upon  the 
way  the  test  is  constructed.  Although 

relatively  little  of  a  precise  nature  is  now  known  regarding 
the  effect  of  item  selection  on  test  skewness,  kurtosis,  or  on 
the  constancy  of  the  error  of  measurement  throughout  the  test 
score  range,  ...  it  is  possible,  however,  to  select  items 
in  such  a  way  as  to  influence  the  test  mean,  variance, 
reliability  and  validity  (Gulliksen,  1950  a,  p.  365). 

This  in  turn  will  directly  influence  the  test  score  distribution. 

Mollenkopf  (1949,  1950)  has  shown  that  the  variation  of  the  error  of 

measurement  with  test  score  depends  on  the  third  and  fourth  moments. 

This  offers  some  difficulties  in  the  theoretical  analysis  of  item 

selection  procedures.  As  a  partial  aid  to  the  solution  of  the  above 

problem,  Ray,  Hundleby  and  Goldstein  (1962)  demonstrated  that  indices  of 

skewness  and  kurtosis  for  a  test  score  distribution  can  be  expressed  in 

terms  of  item  parameters. 

Although  attempts  have  been  made  to  select  items  on  the  basis  of 
the  first  four  moments,  the  selection  of  items  to  form  a  test  with 
given  skewness  and  kurtosis  has  not  been  solved.  Ray,  Hundleby  and 
Goldstein  (1962)  claim  that  any  moment  employed  in  describing  the 
frequency  distribution  of  raw  scores  can  be  expressed  as  a  function  of 
item  parameters  but  they  do  not  show  how  this  information  can  be  used  in 
the  practical  case  of  selecting  items  from  a  predefined  pool  to  construct 
a  test.  Since  the  correlation  between  gross  scores  is  identical  with 


' 

.2n3-toi«903  x**^1*^  ®ri3  Bsoa.ul  nt 

'  •  3  ' 

'  Lj  ■  ■  -'  • 

,>,od  .  9113  lo  noiduloe  arid  od  bJt*  laid^aq  *  «A  .r.^ub  ooiq  ac  doa  .»a 

,d  ,  .i  dps.  se  od  abc*  ^  1  i^'  Jla 

j  fi  ol  03  amadi  io  no  3qj19  riJ  ,e  ica*^  *  j~*  -  1 


21 


the  correlation  between  linear  transformations  of  gross  scores,  the 
equations  dealing  with  the  effect  of  the  test  length  and  group  hetero¬ 
geneity  on  reliability  and  validity  hold  for  gross  scores  and  for  any 
linear  transformation  of  gross  scores. 

The  shape  of  the  score  distribution  may  be  altered  by  using 
various  transformations.  One  of  the  most  frequently  used  is  a 
logarithmetic  transformation  of  a  psychological  variable  to  obtain 
scores  that  are  at  least  approximately  normal  (McNemar,  1962).  Use  of 
the  normal  curve  is  merely  a  convenience  and  is  not  necessarily  based  on 
any  "normal  distribution  of  behaviour"  in  nature.  Since  the  normal 
frequency  distribution  has  commonly  been  found  to  be  characteristic,  or 
nearly  so,  of  the  distributions  of  scores  on  a  wide  variety  of  character¬ 
istics,  it  has  been  established  as  one  particular  distribution  to  be 
used  as  a  frame  of  reference  for  comparison  purposes.  The  normal 
frequency  distribution  has  also  been  termed  the  curve  of  error  (Ghiselli, 
1964,  p.  59)  since  it  is  closely  approximated  in  situations  where  a  score 
is  determined  by  a  large  number  of  factors  which  operate  under  conditions 
of  equal  likelihood. 

Ghiselli  draws  the  conclusion  that 

.  .  .  there  are,  of  course,  a  wide  variety  of  differently 
shaped  distributions  that  could  be  adopted  as  the  theoreti¬ 
cal  model  of  the  distribution  of  psychological  traits.  Of 
all  the  possible  distributions  there  appears  to  be  more  basis 
for  choosing  the  normal  frequency  distribution  (Ghiselli, 

1964,  p.  62). 

Standard  Error  of  Measurement.  The  standard  error  of  measure¬ 
ment  is  an  estimate  of  the  standard  deviation  of  the  errors  of  measure¬ 
ment  associated  with  the  test  scores  in  a  given  set.  In  terms  of  the 


,,  ,0-8  ■'  '<■  a  ,d  '  1,03  "-'3 

_OMJ9ll  ,  ,  h  u  ftJg  3  3  •»  *»  io  *r*i  ,  *.15  dll*  U»b 

,«  lol  bn,.  ssiooa  e«g  30}  blow  y-  U.v  bn.  yjmdalie,  no  ¥>*•>>•» 

. &;9iooa  aaoig  c  >i.  ic  ’  i!  R  - 1  — 

n  al  baeu  yUnnopsil  J.om  orfj  }o  snO  ..an; -!  jemolsnsM  .uolisv 

10  j  U  .(Sdei  ,  lamstloM)  Xaonon  yXsaamlxc  .  >«',  3  3uh’  8S!Mi 

Xaonon  Brio  aonlb  .  rudan  ni  "molv.riod  }o  nol3odii3BXh  Xa«W  yna 

d iii  .^saoqiuq  noai^sqmoo  loi  son®  *'»}  ^  h  as  baau 

- 

eroi.XMOD  osbrni  aJaosqo  rialriw  aJOJo,.!  o  is  or,  »ru>.X  a  yd  bsnlmsiob  at 


.booiUIaill  lR«pa  *o 


-UdToertt  ari3  Bfi  bslqoba  ad  bluoo  3crf3  anol3  •  b  baqfir^ 


vxua&s*  to  aiotia  arl3  io  noi3&lvsb  bn  on»3e  ari3  io  alamUea  or  a.x  J«aai 

.3  a  nsvlg  a  ill  aioaa  ori3  ri3lw  U  lo<,  a.  ;; 


22 


reliability  coefficient,  r  ,  and  the  standard  deviation,  s  ,  the 

XX  X 

standard  error  of  measurement  formula  presented  by  McNemar  (1962)  is 


s  =  s  .  1  -  r 

e  x  \J  xx 

Thus,  s^  is  useful  in  establishing  'true  score'  limits. 

Since  the  reliability  coefficient  is  dependent  on  the  variability 
of  the  group  to  which  the  test  is  applied,  whereas  the  standard  error 
of  measurement  is  affected  very  little  by  this  characteristic,  the 
latter  is  sometimes  proposed  as  a  measure  of  reliability  (Ebel,  1965). 
However,  use  of  the  standard  error  of  measurement  often  assumes  that 
the  error  in  estimating  the  true  score  is  the  same  in  all  parts  of  the 
range  of  the  observed  score.  This  by  no  means  is  necessarily  true. 

Also,  for  tests  using  a  given  type  of  item,  the  standard  error  of 
measurement  is  almost  entirely  dependent  upon  the  the  number  of  items 
in  the  test  and  minimally  upon  their  quality  (Lord,  1957;  Lord,  1959; 
Swineford,  1959). 

With  zero  skewness  and  kurtosis  of  3,  the  error  of  measurement 
is  constant  with  respect  to  size  of  test  score  (Gulliksen,  1950  a). 
Mollenkopf  (1949,  1950)  has  provided  empirical  evidence  to  show  that 
the  error  of  measurement  is  affected  by  the  effects  of  skewness  and 
variations  in  kurtosis.  He  concluded  that  slight  skewing  could  be 
tolerated  but  not  departures  in  kurtosis  from  3  as  the  error  of  measure¬ 
ment  will  then  vary  with  the  magnitude  of  the  test  score.  Lord  (1952) 
has  suggested  that  the  dispersion  of  errors  will  be  smallest  at  the  tails 
of  a  distribution  and  that  the  standard  error  of  measurement  should  be 


considered  as  an  average  error. 


.adimiX  '9100a  suid'  gnxrieildBdao  ni  Xudsau  il  '  »  ,eurfY 

YdxIxdsiiBv  9rid  no  dn9bn9qab  ai  dnsJtoJtiisoo  Y^iXidBiXsi  arid  9oni8 

10179  bisbnsda  srfd  aBsioriw  tb9iXqq£  el  de,d  arid  rfoiriw  od  quoig  arid  10 
arid  jOldaxisdoBisrio  exrid  ^  sXddiX  ^iav  bsdoailU  al  dn9ai9iu8B9<n  do 

1 

.(edei  tl9Cf3)  lo  91U8B9ID  B  38  b980qOiq  89mid9m08  nl  isddaX 

t  Id  to  edisq  I.  b  a  area  arid  £  .  63007  oui  9xld  gn.  dftinid  js  ni  loilo  ‘  rid 

. a 1008  fc  JVi9ado  orfd  to  9 gnar 

2  *  r  £  n  de  id  toi9dJ;  do  »<  1  d  \  f  mu  adrad  x»v  ,oaIA 

gflir  d  xsdraun  9.1  id  arid  noqu  dqsosrsqeb  YXsi^no  ^aotriXa  ai  dn.or  9iues9n 
;{?£€.'  ,  bioJ  ;^£QX  ,bioJ)  Y^ii^P  liarid  noqu  <(XXb  linim  bna  desd  9rid  ni 

,(Q£(?X  (bioi9rtiw3 

t;i9r  six'.  o  101  9  rid  t£  3to  aiaodiuri  bn^  a.. &i:  rax  1  <  mss  rfdiW 

nee.  illuD)  sio:  j  desd  o  -?sia  od  discasi  ridio  dnsdanoo  ai 
dBrfd  worie  od  aon9biv9  Xaoir  .refine  bsbxvoiq  asrf  (0 £61  ,  6A6X)  iqorinsXXoM 
bus  b  a  9rie  ~io  ed  oils  old  Y-d  bedooilfi  ai  dnsceiuaaam  do  10119  arid 
o  xvj  •!  :  drJ;  :i ,  fd  b;  :.  Xonoo  sH  .£  :ao:  ido.:  anoid^iiav 
-s-jif.o:  do  10111  9rid  as  £  mold  aieoxiuri  ni  39iudisq9b  don  dud  badBiaxod 

■ 


.ions  -tgaievs  ns  aa  bsiabianoo 


23 


Transformations  of  Scores.  Since  many  raw  score  measurements 
do  not  have  the  characteristics  of  a  desirable  system  of  units,  raw 
scores  are  often  changed  by  means  of  a  transformation  to  "transmuted 
scores".  This  may  permit  easier  interpretation  of  the  score  and  allow 
comparisons  to  be  made  between  different  tests  or  between  different 
parts  of  the  same  test.  A  distribution  of  raw  scores  is  frequently 
converted  to  a  set  of  norms  since  "a  raw  score  on  any  psychological 
test  is,  in  itself,  quite  meaningless"  (Anastasi,  1961,  p.  76).  There 
are  various  ways  in  which  raw  scores  may  be  converted.  DuBois  (1965) 
defines  two  general  classes  of  norms;  reference  norms  and  statistical 
norms. 

Reference  norms  are  those  which  have  raw  scores  translated  into 
meaningful  work  standards  closely  related  to  psychological  tests.  These 
include  work  norms,  grade  norms,  mental  age  norms  (MA)  and  chrono¬ 
logical  age  norms  (CA) .  Work  norms  are  expressed  in  units  of  production 
in  a  standard  time  interval  by  a  member  of  a  specified  group.  In  age 
norms,  the  mean  performance  for  each  age  is  calculated  and  subsequently 
used  to  construct  a  distribution  of  scores  from  which  to  estimate  an 
age  equivalent.  Quotient  norms  have  been  common  in  mental  testing  such 
as  for  example  the  intelligence  quotient.  The  trend  in  mental  testing 
now  seems  to  be  towards  the  use  of  statistical  norms,  rather  than 
reference  norms.  Wechsler  (1958)  and  Terman  and  Merrill  (1960)  have 
provided  statistical  norms  for  the  Stanf ord-Binet  Intelligence  Test, 
the  Wechsler  Intelligence  Scale  for  Children  and  the  Wechsler  Adult 
Intelligence  Scale. 


1  /.  tv  i  5  d.tlaeb  *  IO  >»>«*>  *>  J  'sri  :'or  ot‘ 


btJUK-nsiJ"  03  noXse-noltnsM  e  lo  anstm  <£d  bsjrmrfo  r.33Jo  *3® 
wolla  hr s  avoae  »rf3  lo  noX3s39iqi93nl  3»Xa*s  sionoq  KB®  tXrf’r 


.ajwon 


fla  »*b  idaa  Oi  rfoiriw  moi  l  ssiosa  *o  noJtdudl^aib  a  JainJanoa  od  bsau 
rfaut  . iaesi  Xsansm  ni  nornnoa  n  od  flvari  tnnon  3fl9i3oop  .  ns  e  jf 


avad  (Odex)  XX1339.M  bon  oan>  »T  bna  (82<?X>  vaXtrfasW  .twoo  tansvt.si 


.sigpa  oanaailXaJdl 


24 


When  mathematical  transformations  are  applied  to  raw  scores  in 


calculating  statistical  norms,  the  norms  are  useful  for  comparison 

purposes  but  have  in  and  of  themselves  no  direct  meaning.  The  three 

main  types  of  statistical  norms  are  percentiles,  standard  scores  and 

normalized  scores  which  differ  primarily  in  the  shape  of  their 

distributions.  The  distribution  of  percentiles  is  theoretically 

rectangular  where  1  percent  of  the  sample  size  is  included  between  two 

adjacent  percentiles.  The  shape  of  a  distribution  of  standard  scores 

is  identical  to  the  distribution  of  raw  scores.  In  general,  if  we  wish 

to  transform  a  set  of  scores,  X,  having  a  mean,  M  ,  and  a  standard 

deviation,  s  ,  to  new  values,  Y.»  with  mean  equal  to  any  value,  M  ,  and 
~x  y 

a  standard  deviation,  s  ,  we  can  apply  the  formula 


s  M 

Y  *  — ^  X  -  — 


s  s 

X  X 


s 


Three  common  sets  of  standard  scores  are  (a)  standard  z  scores  (0,  1) , 
(b)  T  scores  (50,  10)  and  (c)  stanines  (5,  1.96).  Normalized  scores 
are  similar  to  standard  scores  with  respect  to  characteristics  of  the 
mean  and  standard  deviation.  An  additional  property  of  correction  for 
departures  from  normality  are  made  on  the  original  raw  scores.  The 
distribution  of  the  normalized  scores  approximates  the  normal  distri¬ 
bution  with  decreasing  "goodness  of  fit"  as  the  shape  of  the  original 
distribution  departs  from  normality. 


Various  types  of  samples,  such  as  male  or  female  college  students, 


are  used  as  a  basis  for  establishing  norms.  Adequate  norms  for  a  special 
selected  group  may  be  calculated  by  using  a  large  number  of  cases  and  a 


aawM  dT  .snlnaaffl  3 oaxJtb  on  esvlaarnsrid  io  bna  at  *vari  3ud  Maoqinq 
lisriJ  !  >  aqarfe  »sii  ni  \  iisoliq  saiiib  riolriw  da^oas  b*s;  rac 

£93008  io  B  io  aqadB  .BBllinaoxsq  3nooatbB 

bxsbnaie  b  baa  ,  :  .nasm  b  galvarf  ,£  .«0303*  308  8  03 


oris  ,  M  t sulsv  ynis  05  -taupa  na«  u  ri3iw  t,  .  «.  9:1  ■  *x^-  *n'J*36^  ' 

X 


(X  0)  I  >1008  s  biabnaJS  Ca)  aia  as^ooa  btaba*U  io  aiaa  nomroo  bbuIT 

'XO  *'*  ’  '  •' 

B  X.,  3J-:  -  10  in  is  ca  "aii  io  ?.a»nb<  >a‘*  gnlBaaioab  ri3:v  noi. 
.elnsbua  :  *gz  *Xoo  aXtniai  io  ai*up  aa  fioua  .aaXqi  aa  1*  --  <1X  "'ra  11 '* 


25 


representative  sample. 

Test  Development 

"A  test  is  a  general  term  used  to  designate  any  kind  of  device 
or  procedure  for  measuring  ability,  achievement,  interest  and  other 
traits"  (Ebel,  1965,  p.  466).  The  construction  of  any  test  involves 
numerous  decisions.  In  the  preparation  of  a  test,  one  of  the  most 
important  yet  most  often  neglected  aspects  has  been  a  careful  delimita¬ 
tion  and  breakdown  of  the  area  or  trait  involved  (Helmstadter ,  1964). 

A  test  should  be  based  on  a  representative  sampling  of  the  content 
studied  while  having  a  representative  sampling  of  the  abilities  or 
skills  emphasized  in  the  course  (Adams  and  Torgerson,  1964,  p.  322). 

As  no  single  instrument  can  measure  all  skills  over  an  entire  content 
area,  resort  must  be  made  to  the  procedure  of  using  a  representative 
sample  of  test  items.  Ultimately,  the  test  constructor  in  applying  his 
experience  and  judgemental  skill,  decides  exactly  what  will  or  will  not 
be  included  in  the  measure.  What  constitutes  important  materials  can 
only  be  determined  by  careful  attention  to  the  goals  of  a  course.  Part 
of  this  decision  should  be  determined  by  reference  to  future  courses  or 
types  of  employment  that  the  examinees  will  enter. 

Qualitative  Criteria.  The  plan  for  a  test  should  consider  the 
relative  emphasis  to  be  given  both  to  content  areas  and  to  the  processes 
or  cognitive  abilities  which  are  specific  ways  of  responding  to  or 
dealing  with  course  content.  A  detailed  analysis  of  educational  objectives 
for  student  achievement  has  been  edited  by  Bloom  (1956) .  Bloom  and 
associates  have  developed  a  taxonomy  of  educational  objectives  under  which 


-J  ooi  ©ria  io  sno  ,3a©  3  b  io  noias-j  jqjiq  *ri3  n  .tno  aXpa.b  <;:k>  ’ 

(  61  ,i©3  isaem  aH)  bBvIo  n  11*12  30  £^b  ©ria  o  awobdaaic  1301 

io  aai^mdB  aria  gniXqmas  ®vX3*lu9B©3qS3  6  aalvari  ©Xl.'Sr  bsibu  a 
iana  n£  a©vo  alli^e  XX j  j  tiii  s©m  nt  »  3*©«rf3aal  a  8  I®  on  eA 

. 

nro  elBXi  aam  ansaxoqniX  a.  Ju3i3eno:>  3srfW  .©xuasafli  ©rfa  at  babuLoat  scf 
3.  -  ,  j  aXBOg  WJ  03  no  Jn©33a  Xi  iiisa  b©flXarc©3ab  r-  X^o 

d  .  =» b  anoj  bXuoria  ;ai>3  s  lol  naXq  ariT  f vcaaajXsug. 

■jo  03  gni  noqasx  io  a^Bw  ©mo©qt  »«*  -oiriw  aeialXldB  ©v.  ain^oo  o 

#• 


26 


educational  goals  and  test  items  in  the  cognitive  areas  may  be  classified. 
The  major  categories  of  the  Taxonomy ,  in  increasing  degrees  of  complexity 
are  (a)  knowledge,  (b)  comprehension,  (c)  application,  (d)  analysis 
(e)  synthesis,  (f)  evaluation. 

Stoker  and  Kropp  (1964)  report  general  support  for  the  hierarchical 
structure  of  the  cognitive  process  if  evaluation  is  placed  before 
synthesis.  Additional  support  for  Bloom's  notion  of  hierarchical 
structure  is  provided  by  Ayers  (1966) .  The  results  from  a  factor  analytic 
study  by  Ayers  are  in  general  agreement  with  a  hierarchical  nature  but 
there  is  some  question  as  to  whether  or  not  the  same  factors  and  hier¬ 
archical  order,  as  that  presented  by  Bloom,  will  be  confirmed. 

Suggestions  for  preparing  good  test  items  can  be  found  in  several 
books  (Lindquist,  1951;  Thorndike  and  Hagen,  1955;  Ebel,  1965).  A  list 
of  additional  references  for  item  writing  in  various  subject  fields  has 
been  prepared  by  Adams  and  Torgerson  (1964,  pp.  396  -  399). 

Test  items  have  frequently  been  dichotomized  into  essay  test 
items  and  objective  test  items.  In  this  setting,  essay  is  intended  to 
include  other  supply-type  test  items  such  as  completion  questions. 

Objective  items,  which  can  be  thought  of  as  choice-type  instead  of  supply- 
type,  can  be  subdivided  into  true-false  items,  multiple-choice  items  and 
matching  exercises.  "There  is  a  growing  recognition  that  many  of  the 
criticisms  of  both  approaches  are  not  necessarily  inherent  but  grow  out 
of  ineffectiveness  in  their  application"  (Adams  and  Torgerson,  1964,  p.  332). 

In  the  essay  test,  a  few  questions  or  problems  are  presented  and 
students  are  asked  to  supply  the  answers.  A  large  number  of  questions, 
with  a  limited  number  of  alternative  answers  for  each,  are  used  in  objective 


. bal 3 bbI 3  od  ’<J  asaxs  av  odd  nl  .  *9  1  3eis3  hr;, a  alsog  XsnoiJj- oubs 

« 

X3  r.x  iIc;jtoo  lo  ;  a  •  ob  .nxe  saxon  •  r  ...  a  >x.  _  sdJ  o  asllo^jls:;  xc.  tam  .  rfT 
aJ  -■■{J  as  (b)  tnoi:  toll  >)  t  w  -a.  .ct>  moo  (d)  ,*%t> *>  tvonjil  (&)  sib 

.noilsnXsvd  (1)  ,alsad3axB  (a) 

Iso  doxm-jlr  J  ‘  ol  3ioqqua  I.ris  isg  rxc  >i  (£d(?i)  qqo j 

♦ 

aao lacf  baDsIq  ai  noXdsuXsva  11  a;  aooxq  svlJlflgM  aril  lo  axudooxla 
Xsr-  irioxrisld  lo  o  .3  <n  s.  tnooxd  xoi  3xo  XanoidlbbA  .alasddnxs 
ol3xIen£  xo3osl  6  moil  allussi  9fiT  .  (ddlM)  i  is\A  y;d  fc-iblvoiq  al  sxudooxle 
fd  .  .*.?«•.;  ii  .)!'•■  3  sxairf  s  di;  •;  5  .is  .j.  i,  s  Isxaflsg  *•  s  s:\avA  yd  ^bu3 
-X9lri  bns  3i<  j.  i  ©rasa  9d3  3on  xo  xadda.dw  od  as  ■  o.tSeaup  anoa  al  axorfJ 
.  baimi l/ioo  ad  IIlw  tmooia  yd  b93n9asxq  3sd3  as  , i9bxo  Isoldoxs 
Isxavas  nl  bnuol  sd  nso  amsoi  3aa3  booy  gnlxsqsxq  xol  anoldasgguS 

.  (££<2X  ,IddH  jcldSI  tn93aH  bns  a^lbnxorfl  jXcLQI  «3alupbnlJ)  -:tfood 
a &d  ablsll  Xpatdns  auolxsv  nl  gnlalxv  moil  xol  asonaxalax  Xanolllbbs  lo 
.  (2$£  -  d€£  .qq  tAd£I)  noexs,  oT  bns  amsbA  x^  bsxsqsxq  naad 
3a53  x6889  odnl  bssimodoriolo  nsad  13. supsxl  avad  aaradl  389T 
o.i  I  1  8-a  xB889  la^s  aid!  iaI  .  vj!  3  >3  svl309tdo  bns  emi>3l 

»an<  c  ^  a  j  r  r  p  noldslqiDoo  as  dona  «  all  3  *  a  2  aqx^X^qqna  *®d3o  sbuXonl 
■^iqqua  lo  bssa  rci  \3-3olodo  as  lo  3rfg.  od3  d  n,  o  riolriw  «ajns31  9Vi3os£dO 
A>r  >  *31  jiortD-  .  ,  tn  ^  j.:  i  ■  j ■  ox  .  o  n;  fc  »bl\  bdra  ad  ii.  0  ,9qx^ 

adl  lo  x^sni  3  da  aol3Lngoo9x  gnlwojg  f,  »i..dT”  .E9aloi9X9  gnldo3sra 

S££  .q  ,4a  £  «no)  isgioT  bns  isbA)  iol3aoll iq*.  xi^  j  nl  8e9n9vl3oall9nl  lo 
bns  I93n9a9iq  sis  a/nsldoxq  lo  a/rolleanp  wal  e  Ja  i3  x6e89  nl 
«8noiJ89up  lo  un  ajjj  ^  A  .axawanB  3  x^  Sf^e  o3  ba^as  9is  a3n9bn38 
bdt.u  j s  txIoss  loi  a:  >wans  9vllsma3ls  lo  i&dc-un  bsilmli  a  ri3lw 


27 


tests.  It  is  comparatively  easy  to  construct  an  essay  test  but  difficult 
to  grade  for  more  than  a  few  students.  The  multiple-choice  test  is 
relatively  more  difficult  to  construct  but  can  be  graded  easily  for 
many  students.  An  essay  test  is  usually  less  reliable  than  a  multiple- 
choice  test  because  of  the  minimal  sampling  of  content  and  variability 
in  scoring  of  questions.  Although  well-constructed  multiple-choice 
tests  are  accepted  as  effective  measurement  instruments  (Ebel,  1965), 
they  are  often  criticized  as  measuring  only  the  simple  facts  of  subject 
matter  and  thus  provide  no  evidence  regarding  command  of  cognitive 
abilities  of  greater  complexity.  Also,  multiple-choice  tests  are  regarded 
by  critics  (Hoffman,  1962)  as  being  only  a  measure  of  memory  rather  than 
understanding.  Essay  tests  can,  it  is  maintained,  be  used  to  allow  a 
student  to  demonstrate  his  ability  to  organize  and  present  a  creative 
answer.  Rather  than  try  to  decide  whether  multiple-choice  examinations 
are  generally  better  than  tests  of  the  essay  type,  or  vice  versa,  it 
would  be  more  appropriate  to  see  how  they  both  can  be  made  as  effective 
as  possible  and  how  they  can  be  used  to  complement  one  another.  Ebel 
(1965,  pp.  109  -  110)  has  outlined  how  essay  and  objective  tests  are 
useful  for  different  purposes  and  in  different  situations. 

The  use  of  multiple-choice  items,  where  an  item  is  scored  either 
1  or  0,  introduces  the  problem  of  what  level  and  distribution  of 
difficulties  are  appropriate  for  the  questions  included  in  the  test.  One 
answer  is  to  include  only  questions  that  most  students  should,  in  the 
teacher's  opinion,  be  able  to  answer.  If  this  is  done,  many  students 
will  answer  most  questions  correctly  resulting  in  poor  discrimination 
among  students  on  level  of  achievement.  Another  alternative  is  to  use 


31ld.  1  b  Jucf  3**3  X  \L  >  r;i  ,70U33arr  o  ^XovIogi,  oo  ai  31 

« 

10I  ^Ilaaa  babaig  sd  nao  3ud  3oiii3«/ioo  o 3  Jluoiiilb  sioco  xXsvIdalei 
nod  1 1  <  I  o  j:  I  a  3  as  si  y.II  u-u  a:  .isai"  v  :.  ?.at  nA  .sin  bw3a  \a&ta 

X^XIld  iisv  bas  3/is3noo  io  gnllqnaa  XaroJtxiJta  >d3  o  :^uso9d  3b  J  oolorio 
Di0rfo-9lql3loai  b9ioiri33ijoo-ii9w  ci  uotillA  .  artctbsoup  io  gr.Xiooe  ai 
«  (<!d9I  ,l9ci  )  a j as  wiiaai  3neai83a3S-  /  >v  iosiis  ®£  bi  iqsoofi  9ia  ainad 
Joaf/ya  -d  sjo  g  qmie  >ril  ^inc  g^  :iuas9  i  *a  bes.Iol3.t3o  rrsilo  ars  ^adi 
svli-rtgoo  :o  oasm/ noo  gaxbisgsi  isoa  Ivo  on  eblvoiq  eod3  ua*  isilam 

.^3±x»I;  zoo  3939933  io  e9l3JtX.tdB 
nsrlJ  39d3B3  \3oatam  io  93uaa9m  y « .io  a  sd  sg  (  i  tnamllol)  30I3X30  v;c 

b  volls  o3  beau  9  tb9«X33aX6/n  ei  3i  faao  a3  ,o3  yoaa3  .gnibnalaxabm/ 

iv  u  i—j  b  irises  iq  has  *si;"  ro  o3  \  &  [  u  Manor  b  o 3  3nsbi/3a 

31  « B319V  9olv  30  ,oqy;3  ysaas  orij  i  333  3  nad3  3933£>d  ^IXatsaag  93B 

.  )o:L  3  eb  !  _bci  9d  ii  x  7od  yoi  1  \  on  os-  ex  3  jo3qrB  anrom  ad  b  l  uovr 

931  83J  :< 3  sviJDst^o  bxjt  y  ,aza  wod  bo  .21  oo  ./■  .i  (CXI  -  90.  .qq  t5d9X) 

isdixs  ooa  ;,  .  iit.  .  3  i  tarsJJt  e:iodo**  ,i3i  n  io  »au  T 

. 

1(  "'  X  V  •  >\  • !  V  z  ,;»J  v  ji.j  u.)  i.03 

3i  be  uXon  anc,  natq.  g"  i  nc  .  >  :  >;b  i  •  Ioj  ,b 

.  at  03  ol  sd  fnoxnlqo  c  3>do693 

. 


28 


items  on  which  approximately  half  the  students  are  successful.  This 
approach  will  contribute  the  most  information  as  to  relative  levels  of 
achievement  among  the  students  tested.  When  the  difficulty  level  of 
an  item,  jd,  is  .5,  the  maximum  possible  item  variance,  _s,  is  obtained 


by  s_  = 


Nl 


J3.  ,  where  =  1  -  jd.  Departures  from  jd  =  .5  will  result 

in  a  decreased  item  variance.  Although  departures  from  £  =  *5  may 

yield  more  reliable  scores  for  the  same  amount  of  testing  time,  an  optimal 

psychometric  situation  where  £  =  .5  may  prove  to  be  more  worrisome  to 

the  students.  When  jd  =  .5,  half  of  the  students  will  fail  any  item 

resulting  in  a  mean  score  of  only  50  percent.  It  should  be  noted 

that  jd,  an  average  item  score,  is  also  an  average  index  of  item 

difficulty  for  individuals.  Coombs  (1950)  has  commented  on  the  fact 

that  the  difficulty  of  an  item  varies  for  different  individuals.  The 

index  does  not  yield  accurate  information  concerning  the  item’s 

difficulty  for  a  given  individual. 

"There  is  no  formula  for  determining  the  exact  distribution  of 

item  difficulties"  (Freeman,  1955,  p.  39).  The  determination  of  the 

optimum  difficulty  of  the  test  items  to  be  used  in  a  test  is  a  problem 

on  which  there  is  not  complete  agreement  among  test  specialists.  Some 

test  authorities  prefer  approximately  equal  numbers  of  items  at  all 

levels  arranged  from  very  easy  to  very  difficult  (Remmers  and  Gage, 

1955;  Nunnally,  1959),  others  prefer  to  have  the  majority  of  items  near 

the  50  percent  difficulty  level.  Richardson,  for  example,  found  that 

...  a  test  composed  of  items  of  50  percent  difficulty 
has  a  general  validity  which  is  higher  than  tests  composed 
of  items  of  any  other  degree  of  difficulty  (Richardson,  1936, 
p.  47). 


. 

.  -  £  oisrfw  ,  2. 1y  *  5.  xd 


erf?  .aiaubiviba  JD'nslUb  loi  &aJ  ibv  ms**  xia  5o  y;3lij:>imb  ad*  3axi3 
3\:  ,.ji  erf  3  ynin  -:  "  n  o  no>:iL.  t ..  mi  tJBi'/ot'  L  •-  *> 5  :<  3 on  aaob  £  ixabrri 

.Ie.ubJtvJtbn±  navij  a  30I  x3laom*b 


max  dosq  b  aJt  j«a3  an*  v  acJ  t  au.a  :  -  j*-3  “o  '<3  .  t  ..xxin  .  / 

.  83 ail  ^  s?  fcf»3  snort  ‘  -  a*f.  a  »  <i 1  *'•**  Ji  JJtfl  :  -:0i-"w  flo 


■ 


^gsD  bXiF  aia/mnoH)  aluo.'i  b  yiav  o3  w  *  \f  iav  non  i:  jgnaiis  alsvai 


■>  bi  t  '  ■  -  -  >3  isU'x-q *  x'd  a  ,  ,  cXIi-'inoM  •{2  V  i 


^3XuoJtilib  dnaoia-.  02  2o  exnaJi  So  bx>;  oqtnoo  3  8&J  a 

■ 

>r;  [  ,noe  rl  .:>•)  vXlooiiS  )  ja-i^ab  -.rariao  ^ns  Id  *  »* 

,  v  A  .  q 


29 


Gulliksen,  in  a  theoretical  analysis,  concluded  that 

In  order  to  maximize  the  reliability  and  variance  of  a 
test  the  items  should  have  high  intercorrelations,  all 
items  should  be  of  the  same  difficulty  level,  and  the 
level  should  be  as  near  50  percent  as  possible 
(Gulliksen,  1945,  p.  79). 

In  spite  of  the  fact  that  the  maximum  item  criterion  correlation 
occurs  for  items  of  50  percent  difficulty,  another  special  level  of 
difficulty  may  prove  to  be  valuable  in  a  particular  situation.  When 
items  have  low  intercorrelations,  a  distribution  of  item  difficulties 
clustered  around  the  50  percent  level  often  approximates  the  distri¬ 
bution  required  to  obtain  maximum  discrimination  throughout  the  range 
of  scores.  The  distribution  of  difficulty  indices  should  be  made  more 
platykurtic  or  rectangular  than  usual  if  equal  accuracy  of  measurement 
and  discrimination  are  desired  throughout  the  range  of  scores  for 
items  with  relatively  high  intercorrelations.  An  extended  discussion 
of  the  above  statements  has  been  prepared  by  Brogden  (1946) .  When 
selecting  a  specific  group  of  subjects,  Lord  (1953)  suggests  that  the 
average  item  difficulty  should  match  the  selection  ratio.  If  the  top 
30  percent  of  persons  were  to  be  selected,  the  most  efficient  test  would 
be  that  for  which  the  average  item  difficulty  is  at  30  percent. 

The  general  procedure  in  common  practice  "in  the  arrangement 
of  standardized  test  items  tends  to  follow  the  procedure  of  presenting 
items  covering  a  wide  range  of  difficulties  in  ascending  order  from  the 
very  easy  to  the  most  difficult"  (Greene,  Jorgensen  and  Gerberich,  1954, 
p.  91). 

Apart  from  statistical  decisions  the  onus  is  on  the  test 
constructor  to  select  the  desired  level  or  levels  of  item  difficulty,  to 


srfj  faefanlonc >  .aXgqXfcfla  X  olJaiosdJ  b  al  ,naejU:XXuO 

• 

J  0  v  bn*  tlllirfi  *3  arid  asi  lx®  i  od  i&b3o  nl 

1  &*  «■  ;oXd  V  oo^dnl  1  v i r  VKri  blur •  •  anratfi  ar  t  des: 

.1 1>  t  ova  vdluatilkb  or  er  arid  io  ad  bXuorie  eradl 
slaJtsBoq  es  dnsoosq  02  daan  ee  sd  fcXuoria  X9V9I 

.q  ,£ASX  ,flJ8jilXXuO) 

n°X-  -  too  ..  iio  :  n.:  •:  :£r.  -  /  .  Ed  t  J  :-  arid  Vi  ddJtqs  nl 

o  l£>vsI  Lslasqa  39ridons  ,  xdlnoJti  ill  dn903t:<q  02  io  eir*d±  3oi  aiuooo 
.  lOldfcl  ‘  .3  :  3S  [X  sXcfiiUlBV  9  03  9V03q  Y*fll  YdXi/Oliilb 

8  1  x :>a.  ni9:  o  no ? dudlodaib  8  « snold Eloddoaxadni  woX  avsrf  grasdl 

98n£3  9 rid  dnorisuoxrid  noid  nXnJVogXb  raumxxBrn  nxaddo  od  fasiJtupax  noldL-d 
9'-'°  ,d  y  \  -i  3-  X  :  : -iio;>a  io 

9  U3B  >il  OB  moo*  Xbi/9  1  Lbjubu  nsrid  3B  X  uj.<  os  d  o  3  3o  oi  d3i/;JxdfiXq 

3008  io  t  f;.3j  i  arid  due  ;^.>03/i  i  fasoXt  »fa  93*  noidsni£nl3oe£b  fans 

no  eat/OBifa  fagfanadxo  nA  .anoldBlsiiooi  3  il  ri.-  ti  xIovldalB3  rfJlw  ejisdi 
.(dAei)  nabjgodfl  yd  faexBq.ocq  .  T  .  ,rf  adnaoi&d,  da  svod?,  arid  io 
31  3  1  ttli  > )  b3oJ  ,8*.o-  >  c. fiod.jj  oX  •  Xoaqa  b  gnldoalae 

j,  .  e:  j  n  1  e  1  ioot  arid  «bs.:io.^X9e  ad  od  idsw  anosiar  io  dn903  q  02 

-J .  9CI  'Jc  -  eX  ydj  join  h  i  di  •:  v  t  arid  roiriw  xoi  3  :j  ad 

dnfcB9gxiB'i3B  9rid  n :.  aoidOBxq  notrwico  ni  93ufai»o©3q  Xaiansg  sriT 
8nXdn9i  3iq  0  sduiaoorq  aijd  volloit  od  ebnaJ  *  dl  daad  basXfadBfanada  iq 

. 

*  .  -q 

J  ■  1  -  -  >ic  ji  b  .*  .  roxi  ':  •  .Hj  ! 

•  y  >J  Xo  «.I  :.va  jo  lavs  bs3:;  •*  '■■  d  dofiX9a  od  lodcuidanoo 


30 


suggest  whether  or  not  a  weighting  of  items  is  necessary,  to  decide 
whether  or  not  to  correct  for  guessing,  to  accept  or  reject  a  given 
level  of  reliability  and/or  validity  coefficient,  and  to  make  decisions 
on  a  host  of  other  major  considerations  in  preparing  a  test.  No 
present  statistical  technique  can  replace  the  judgement  of  the 
subject  matter  expert  in  the  selection  and  rejection  of  items  to 
sample  representative  content  domains  and  educational  objectives. 

Quantitative  Criteria.  In  developing  a  test  consideration  is 
given  to  the  statistical  properties  obtained  through  item  analysis  for 
each  individual  item  obtained  by  an  item  analysis.  When  the  decisions 
have  been  made  by  "experts"  on  cut-off  points,  it  is  relatively  easy 
to  select  items  with  the  desired  properties.  The  items  selected,  on 
the  basis  of  item  characteristics,  can  now  be  used  to  form  an  item 
pool.  It  would  be  desirable  to  select  a  sample  of  items  from  the 
item  pool  that  would  result  in  a  desired  mean,  variance,  skewness, 
kurtosis  and  distribution  shape.  This  is,  as  yet,  not  possible.  The 
most  frequently  used  procedure  at  present  for  constructing  a  test  is 
based  upon  the  results  obtained  from  an  item  analysis. 

In  many  situations  item  pools  are  constructed  by  selecting 
items  on  the  basis  of  an  item  analysis  of  several  tests  administered 
to  different  groups  of  subjects.  Tests  are  subsequently  constructed 
by  selecting  items  on  the  basis  of  the  initial  item  analysis  used  to 
construct  the  item  pool.  However,  it  must  be  noted  that  item  analysis 
results  for  an  item  are  always  specific  to  the  particular  group  and 
the  particular  subset  of  items  into  which  the  item  is  embedded.  Thus, 


3  j  03  . 3f»i  b!  basbJ c  .  to  a  l3  i  -  :*on  '0  fd'i  t..u’  >.•  3a  -  ,s.  -*<- 

. 

■  '  '  2  0  v  -  :  i  •  - 

03  afl*3l  lo  noi30B.t 9i  bns  nol3o9l»a ... sd3  al  3i»qx9  T933bot  309^8 

!  ,  >  .  '  i  --  ‘"  •  V.  - ’  -s-i  ■  •  J  '  *  •*  .  9  •*•’•} iHhw 

ajt  nol3fii9fai8noo  38t>3  b  gaiqolsvab  a. 

■ 

' 

P.  jt  a&  rnioi  o3  >seu  i  won  aco  ,  e^iJ  .av^orusrtp  rao3i  lo  aiex.d  sri3 

£ 

srlT  .©Id! aeoq  3on  ,39^  8fi  «8*  sXrfT  .Aqarftl  joi:3udli38±b  baa  8X803iujI 

bdet;  x^3n9uP9:r^  ^aoar 

. 

, 

X  3noJJpB8dU3.  93E  a3«5S' 

. 

. 

•  i  H  ■  - 


31 


the  item  analysis  for  a  sample  of  items  from  the  item  pool  might  well 
differ  from  those  used  to  construct  the  pool  of  items.  On  this  basis 
therefore,  it  seems  reasonable  that  whenever  possible  the  initial  item 
pool  should  be  re-evaluated  in  terms  of  the  user's  needs. 


sd  sirfd  aO  .email  lo  Xooq  noD  03  k>au  asoiid  iioif'ia  l1 

’  *.  '•  ■  \  7  . 

inadl  Xit:Xfil  srfi  aXdleaoq  ievaoadw  i  rid  aldenoesai  am^ae  31  *i J 

.ebasn  a’laau  arfl  lo  anrxai  nl  bsiei/iava-ei  ad  bXuoria  Xooq 


CHAPTER  IV 


ASSESSMENT  OF  TESTS 

Whether  a  test  is  "hand  made"  or  developed  according  to  some 
analytical  criterion,  all  tests  must  be  subjected  to  the  same  criterion 
for  their  evaluation.  A  multitude  of  approaches  are  available  for 
assessment  of  tests. 

The  most  recent  reference  that  provides  a  general  consensus 
by  authorities  in  the  field  of  measurement  regarding  what  and  how  to 
evaluate  tests  is  Standards  for  Educational  and  Psychological  Tests 
and  Manuals  (Standards)  (1966) .  Although  the  presentation  in  the 
Standards  is  very  brief  and  must  therefore  be  supplemented  with  material 
from  other  publications,  it  should  be  considered  as  an  authoritative 
voice  in  deciding  what  is  relevant  or  nonrelevant  in  evaluating  a  test. 
As  an  aid  to  test  development,  the  Standards  provide  a  kind  of  check¬ 
list  of  factors  to  be  considered  in  designing  the  standardization  and 
validation  of  tests.  The  main  topics  covered  are:  (a)  dissemination  of 
information,  (b)  interpretation,  (c)  validity,  (d)  reliability, 

(e)  administration  and  scoring,  and  (f)  scales  and  norms. 

Validity 

Test  validity  is  concerned  with  what  a  test  measures  and  how 
well  it  does  so.  Validity  is  a  complex  concept  that  has  been  inter¬ 
preted  in  various  ways  by  different  writers.  Many  types  of  validity 
and  their  general  classifications  have  been  described.  Thorndike  and 
Hagen  (1955)  suggest  a  dichotomy  of  types  of  validity:  validity  which 


Ji  ' 


!  31  '  •*»'..  ■'  ?'U|  r.  i  :  V 

•3  >3  -  JbIJ  svs  iiT£-  s  H  cn  rqa  lo  fcbuj-  Jltwi  A  • . i'OJt  z  . t  ;  hv*  r  U*..'3  o' 

>  a: ‘ft 

OJ  wc 

"  ••  .  -  ba K  '  .  .  '  '  J  -  '  ikw1a\ 

- 

**  N 

'  •  ••  noo  ocJ  bluftifE  jJt  #anol3«9lIduq  lifepto’ noil 

' 

:  -if 

' 

j 

'  4 

'  •  ,  :  ,  '  ‘  '  •  ■■■'..  •  •  .  i‘ 

.  d  .  j  r  ,  ?f'-'  iV 


33 


depends  primarily  upon  rational  analysis  and  professional  judgement,  and 
that  which  depends  upon  empirical  and  statistical  evidence.  The 
dichotomy,  similar  to  that  above,  proposed  by  Ebel  (1965)  is,  respectively, 
concerned  with  primary  or  direct  validity  as  contrasted  with  secondary 
or  derived  validity. 

Some  types  of  validity,  reviewed  by  Ebel  (1965),  that  seem 
appropriate  for  each  category  are  listed  below: 


Direct 

Validity  by  definition 
Content  Validity 
Curricular  Validity 
Intrinsic  Validity 
Face  Validity 


Derived 

Empirical  Validity 
Concurrent  Validity 
Predictive  Validity 
Factorial  Validity 
Construct  Validity 


The  distinction  between  the  two  categories  is  not  explicit  nor  clearly 
defined  since  factorial  validity  and  construct  validity,  "despite  their 
involvement  of  multiple  measurements  and  coefficients  of  correlation, 
do  represent  a  basic  (primary)  kind  of  validity"  (Ebel,  1965,  pp.  381- 
382).  A  standard  reference,  Technical  Recommendations  for  Psychological 
Tests  and  Diagnostic  Techniques  (1954) ,  has  the  various  types  of 
validity  classified  under  four  categories,  designated  as  content, 
predictive,  concurrent,  and  construct  validity.  These  four  aspects  of 
validity  have  been  used  as  a  basis  for  developing  more  elaborate  sub¬ 
classifications. 

In  Standards  for  Educational  and  Psychological  Tests  and  Manuals 
(1966),  a  revision  of  two  documents:  (a)  Technical  Recommendations  for 


■ 


jJ  x  q.  S 

Y3ibil'7  JDi/T38floO 


■-  -  V  '  -  '  ' 

jfX  r  .  j.  b  r 

>  iJkbj  UV  ao  I 


:o  .  Ji  >i  .  rca  o a  si  1  J  1  4 


34 


Psychological  Tests  and  Diagnostic  Techniques  (1954)  and  (b)  Technical 
Recommendat  ion,s  for  Achievement  Tests  (1955) ,  three  kinds  of  validity 
coefficients  are  distinguished.  The  three  aspects  of  validity  corres¬ 
ponding  to  three  aims  of  testing  may  be  designated  as  follows: 

1.  Content  Validity  -  The  test  user  wishes  to  determine  how 
an  individual  performs  at  present  in  a  universe  of 
situations  that  the  test  situation  is  claimed  to 
represent . 

2.  Criterion-Related  Validity  -  The  test  user  wishes  to 
forecast  an  individual’s  present  standing  on  some 
variable  of  particular  significance  that  is  different 
from  the  test. 

3.  Construct  Validity  -  The  test  user  wishes  to  infer  the 
degree  to  which  the  individual  possesses  some  hypo¬ 
thetical  trait  or  quality  (construct)  presumed  to  be 
reflected  in  the  test  performance.  (Standards  for 
Educational  and  Psychological  Tests  and  Manuals,  1966, 
p.  12) 

"Probably  the  most  sophisticated  form  of  content  validity  is  that 
which  makes  use  of  the  technique  called  factor  analysis"  (Helmstadter , 
1964,  p.  92).  In  like  manner,  Guilford  maintains  that  "the  best  answer 
to  the  question,  "What  does  this  test  measure?"  is  in  the  form  of  a  list 
of  primary  factors  with  which  it  correlates  and  their  proportions  of 
variance  in  the  test"  (Guilford,  1965,  p.  472).  The  above  validity 
estimate  is  known  as  factorial  validity.  According  to  Guilford  (1965) 
this  type  of  validity  is  basic  to  the  understanding  of  other  kinds  of 
validity  and  of  many  phenomena  of  correlation  in  general. 

Whereas  predictive  and  concurrent  validation  are  judged  for  a 
test  by  a  statistical  study  of  results,  content  validity  is  established 
by  logical  examination  of  the  test  and  the  methods  used  in  its 


35 


preparation.  Subjective  judgement,  "be  it  termed  professional  judgement, 
common  sense,  or  'expertese',  is  involved  in  all  phases  of  content 
validity  and  is  its  paramount  characteristic"  (Ghiselli,  1964,  p.  345). 

Although  subjective  judgement  and  factorial  validity  seem  to 
represent,  respectively,  an  evaluative  position  based  on  personal 
opinion  versus  an  objective  statistical  solution,  subjective  judgement 
plays  a  prominent  role  in  factorial  validity.  The  postulated  constructs 
represented  by  each  of  the  factors  resulting  from  a  factor  analysis  are 
defined,  in  the  main,  by  persons  familiar  with  the  variables  used  in  the 
particular  analysis  being  considered. 

Emphasis  has  been  given  to  content  validity  because  of  its  basic 
position  in  all  measurement  problems.  Since  test  questions  are  only  a 
sample  of  all  possible  questions  that  might  be  asked,  items  may  or  may 
not  be  representative  of  the  total  domain  of  appropriate  questions.  In 
an  ideal  situation,  a  test  constructor  should  define  a  subset  of  the 
universe  to  be  studied,  e.g.,  an  outline  of  the  course  content  should 
be  used  in  preparing  an  achievement  test,  from  which  a  sample  of  items 
is  selected  to  represent  the  content.  Test  developers  should  exercise 
great  care  to  match  their  achievement  tests  to  the  course  of  study. 

Item  sampling  is  sometimes  very  poor  in  tests  constructed  by  an 
inexperienced  or  untrained  tester. 

Content  validity  requires  judging  whether  each  item,  and  the 
distribution  of  items  as  a  whole,  covers  the  subject  matter  of  interest 
to  the  tester.  The  decision  to  accept  or  reject  an  item,  on  the  basis 
of  its  content,  remains  with  the  test  user  rather  than  the  test 


■ 

Itab,Mq  no  teaod  ooiJ-aoq  svUauUva  04  ,Xla-*l3*>q»*i 

.beitjble.ioa  fcclad  salwoi  3:tF-q 


36 


constructor o  Although  the  test  constructor  can  state  the  source  of  his 
items,  they  will  rarely  correspond  perfectly  to  what  the  tester  requires. 
Thus,  it  would  appear  that  content  validity  is  one  type  of  validity 
with  which  we  should  be  deeply  concerned.  The  assumptions  underlying 
the  use  of  content  validity  have  been  summarized  by  Lennon  (1956). 

Two  approaches  are  used  in  calculating  a  criterion-related 
validity  coefficient ,  The  procedure  is  essentially  a  measure  of 
statistical  relationship  between  test  scores  and  one  or  more  external 
variables  considered  to  provide  a  direct  measure  of  the  characteristic 
or  behaviour  being  evaluated.  If  a  test  is  to  be  used  for  assessment  of 
present  status,  the  criterion  data  should  be  collected  concurrently  with 
the  testing.  For  predictive  purposes,  the  criterion  data  would  usually 
be  collected  at  a  later  time. 

Cronbach  and  Meehl  (1955)  presented  the  notion  of  construct 
validity  which  has  been  formally  adopted  by  the  American  Psychological 
Association,  the  American  Educational  Research  Association  and  the 
National  Council  on  Measurement  in  Education'*'.  A  combination  of 
logical  and  empirical  attack  is  required  in  gathering  data  to  examine 
construct  validity.  Although  construct  validity,  as  a  concept,  appears 
to  be  fully  acceptable  to  many  authoritative  psychometricians,  Horst 
maintains  that  "it  is  very  difficult  to  incorporate  it  (construct 
validity)  in  or  integrate  it  with  a  logical  and  practical  theory  of 
measurement"  (1966,  p.  346).  While  there  may  be  problems  associated 
with  using  the  concept  of  construct  validity  in  measurement  theory,  the 

^See  Standards  for  Educational  and  Psychological  Tests  and  Manuals  (1966). 


,c;  U1  'oa 

- 


37 

general  consensus  appears  to  be  that  of  retaining  the  term  and  the 
theoretical  framework  upon  which  the  notion  rests* 

The  emphasis  in  the  definition  of  validity  is  upon  what  is 
being  measured „  It  must  be  emphasized  that  there  is  no  one  measure  of 
validityc  A  test  or  scale  is  valid  for  the  particular  scientific  or 
practical  purpose  of  its  user*  Thus,  different  types  of  investigation 
are  required  to  establish  the  validities  when  several  types  of  criteria 
are  involved.,  The  procedure  for  establishing  criterion-related  validity 
differs  from  the  approach  used  to  determine  construct  validity  which  in 
turn  differs  from  how  content  validity  is  established  for  a  test.  When 
assessing  the  validity  of  a  test,  the  question  "Valid  for  what?"  should 
be  answered* 

Reliability 

"The  reliability  of  any  set  of  measures  is  logically  defined  as 
the  proportion  of  their  variance  that  is  true  variance"  (Guilford,  1965, 
p,  439),  whereas  the  index  of  reliability  is  the  "correlation  between 
true  and  observed  scores"  (Gulliksen,  1950  a,  p.  22).  When  reliability 
is  defined  as  the  ratio  of  the  true  score  variance  to  observed-score 
variance  in  the  population,  the  ratio  is  sometimes  known  as  an  intra¬ 
class  correlation. 

Traditionally,  reliability,  a  generic  term  referring  to  many 
types  of  evidence,  is  concerned  with  the  question  "How  consistently 
does  a  test  measure?"  Several  approaches  to  score  consistency  results 
in  several  types  of  reliability  coefficients*  All  types  do  not  answer 
the  same  questions.  As  a  result  of  inconsistency  in  terminology  used 


sell  bf  e  nnsr.  *rfa  nrnlfrHvj  U  13  *0  oa  ai.  a  rqo  e-janaanoa  laianaa 

< 


f£  ...  a  f  )i  !  •  :.  '  rt  riw  c  jqu  U'.-JW*  nail  UolJ^XOddJ 


ai  a  dw  no<u  ai  yJ  bcl  tv  lo  ncir  knll  b  jrf3  ni  areariqms*  sr  I 


.,  ,  ,  .fa9«wan«  9d 


X  . 


38 


by  researchers  and  vague  definitions  of  terms,  a  joint  committee  of  the 
American  Psychological  Association,  American  Educational  Research 
Association,  and  National  Council  of  Measurements  Used  in  Education  pre¬ 
pared  a  publication  entitled  Technical  Recommendations  for  Psychological 
Tests  and  Diagnostic  Techniques  (1954)=  An  attempt  was  made  to 
standardize  and  classify  the  various  types  of  reliability.  The  three 
main  subclassifications  are  as  follows:  (a)  A  measure  based  on  internal 
analysis  of  data  obtained  on  a  single  trial  of  a  test  is  to  be  known  as 
a  coefficient  of  internal  consistency.  The  most  prominent  of  these  are 
the  analysis  of  variance  method  (Kuder  and  Richardson  (1937)  and  Hoyt 
(1941))  and  the  split  half  method,  (b)  a  coefficient  of  equivalence  is 
obtained  by  calculating  a  correlation  between  scores  from  two  forms  given 
at  essentially  the  same  time,  and  (c)  the  correlation  between  test  and 
retest,  with  an  intervening  period  of  time,  is  a  coefficient  of  stability. 
The  latter  procedure  may  be  used  with  parallel  forms  of  a  test  or  a  second 
administration  of  the  same  test  after  an  intervening  period  of  time. 

Cronbach  (1951),  using  one  of  the  reliability  formulas  that  was 
derived  from  a  more  general  theoretical  approach  by  Kuder  and  Richardson 
(1937),  designated  a  particular  reliability  index  as  "coefficient  cc" 
which  would  replace  the  name,  "Kuder-Richardson  formula  number  20",  now 
commonly  used.  Attractive  features  of  the  formula  used  to  calculate 
coefficient  «  are  that  it  yields  the  mean  of  the  correlations  resulting 
from  all  possible  ways  of  splitting  a  given  test  into  two  halves  and  that 
it  gives  the  proportion  of  first-factor  variance  extracted  from  the  inter¬ 
correlations  of  the  test  items.  An  additional  feature  of  the  formula 
used  to  calculate  coefficient  «•  is  that  the  formula  is  not  restricted  to 


lo  sullln  too  fnlot  '  3  -to  an  >Jif  i*b  *i  bn»  a  ivboiimaa*  yd 

n  .;i  ,U  (A  ,  ;•'  L  0<;-  '•  ;.n.  ;0:  ■  =ri  M  ’*i  t'ittA 

-9 :t q  r;v  J  n  ifc^  ,  -  ..•:;:  u.:\'.  X  l  Uro;  i},Vl  fcfl-  ,  HOi  'tai  JoraA 

iHoivoXori  ^  iS  !  .bxi.  •  ra  ni  ■  ;vt ;_ui^  bs»I:i  •  las*  noli  n  ,d  rq  a  bsifiq 

.1  Ml:  ■  3-  •  A  .  .0  d  ■  .  /;  /■.  js  /•  oT 

(1.;  ■•.IT  .  \  :•  :  •  •  o  VH,  ,:?  viJ  Y  t  v  I.  j  bu«.  •  -bittbralte 

no  •  u  ft  n  A  (V>  ;  --hjJ  I  •  j»  a  3..  -  -xio*  utii» 

&  nwon  !  >rt  os  at  J.  j  ft  io  ;  i  i  »I(qnl  n  m.;  .  >o  a.»U>  *• 1  x\  Irrm 

‘i  I  it  •  ■•;  •  f-  il  l'  .  v  >f  >  Igiwt  .•  :<m  to  .U-i  <■  i;  oo  #t 

;  .-;oH  t  if£  (U  i>  i  ...-f.it  r,  :•  H  bn  .  •'  <  i  - {  i  - »J»  -  O'  ■  tt\*  ”0  -  •  ■"  V  *<iu 

iUNmi  Uad  Kfi  ®rfa  tea  8X^i) 

^  X 

..•ovi. itt  tc-  i  ov  f  no'  ;  :  3  ri  »w  i  i  "■•:■;  n,  tvw  J«iJwai;  >  ■<*:'  1  ^J  '  vkio 

bn..  '  if.-  >  I  il  s  i.  •  (  -  *  Ift'to  -mve  .»rtJ  •{!  J  ;n  j< 

•  y  •  Uld«  Ja  UntoJt-i  o>  ft  as  ,«ni  t  to  b<  to#*  ^rtlo  »v  .oirti  nr.  rial*  ,  Ja»3®i 

b  o  ;  {O  J B fi> .  «  ■  0  xiJIiOd  J*  I  •  •  )'»<•  }  i’  •  h  1-i.OiU  »d  ytiflt  ’  .tftb.voOt q  -  ■  '  v  I  »d  ? 

,  .  ,  ?  tf  i«G  oJ  'of  I  '1  r  .1  '  ;i  3  •■.:■•:  •  *d  ?  u)  f  i:  Ji  •  ,1 « •  '  !  b» 

/  j  Lunto  ■  S  ■  Ud  n  >r  I  h  ■  ,(i  t'i)  rloftJno'iO 

no.  vj  i  :  '  v.  -fibft.d  Yd  d  to  ’  [o  J  •  to  -  •■  it  i  I  •  •  :  •  i  Jt  ■  tv  i»0  J  b'-w'.  »*t»b 

3X10  i  lit;  o  "  x  bin  v  3  i  .1 1  di  ■  i  i  '  •«  :  .-t..;  ft  bo  J  t  i.  ..  ('  IM  l) 

.ton  «"(  '  lodtiun  >  tii/iiio  i  n.o;;  rrrlojd  ibbu/J  ,o.  '  ..o;  o  .-j  toi-i  «v'i  bhi> x  ."'.’j  dw 
n  !■  '  :  ■  t  •  ■  •  o- 

(ni  fi  n  il  •  no.i  ‘ •  3i.o  iri3  to  f  sun  ml 3  bl  j  ^  ti  *  »  &»  1 '«  *  snvl  »  iooO 

t  .  V  •  o  ■:  t  '  !  v  r  ,  II  1  '  (  ;  o  #  >  •  1  ’  V  X  \  il*  '  j  1 

■it  ini  atii  mo  -  i  b  no  m  .  9  ,r--il  cjv  ro 3.0  .  j  ix  to  aoJ -Moqotq  art  s  »»vjtg  il 

:nn  ,  >  t  *vJ  >il  J  i>.-  HfiOf  3aXotl03 

c  I  1*3.)  -  IIP,  i  i  n  ■  .I'li Treat  »d3  '  S  l  In:-'  •»  '  K  >  jti-vl  OoffiO  03  bo «U 


39 


items  scored  0  and  1. 

The  reliability  of  a  test  is  often  referred  to  as  being  a 
measure  of  internal  consistency,  rather  than  a  temporal  (retest)  index, 
which  seems  to  follow  logically  from  classical  test  score  theory 
(Baggaley,  1964) o  This  is  further  reflected  in  the  observation  that 
"in  developing  the  vast  majority  of  tests  constructed  today  the  makers 
strive  towards  internal  consistency"  (Guilford,  1954,  p.  388). 

However,  while  Guilford  (1954)  maintains  that  reliability  is  the 
minimum  information  one  should  have  concerning  a  test,  he  further 
suggests  that  it  is  certainly  not  the  most  useful  information.  "It  is 
sometimes  said  that  reliability  is  important  because  it  contributes  to 
validity  and  that  validity  is  the  important  goal"  (Guilford,  1954,  p.  389). 
Thus,  "a  test  cannot  measure  more  accurately  what  it  is  intended  to 
measure  than  the  accuracy  with  which  it  measures  what  it  does  measure. 

Hence  in  order  to  be  valid  a  test  must  be  reliable"  (Ebel,  1965,  p.  389). 
Therefore,  to  be  concerned  with  test  validity  directly  implies  a  con¬ 
sideration  for  the  reliability  of  a  test.  "Reliability  is  a  necessary 
condition  for  validity  in  an  educational  achievement  test,  but  it  is 
not  a  sufficient  condition"  (Ebel,  1965,  p.  309). 

In  the  present  set  of  Standards ,  the  reliability  coefficients  are 
not  classified  into  several  types  as  in  the  Technical  Recommendations . 

The  explanation  given  for  this  move  is  that  the  "terminological  system 
breaks  down  as  more  adequate  statistical  analyses  are  applied  and 
methods  are  more  adequately  described"  (Standards  for  Educational  and 
Psychological  Tests  and  Manuals,  1966,  p.  26).  It  is  recommended  in  the 
Standards  that  test  authors  work  out  suitable  phrases  to  convey  the 


* 

^oorta  aioas  3893  iaalaeslo  moil  voile  ol  tmt»  -W« 

»d3  al  X3llld0ll.3  3«rf,  siiisaalsm  C*«I)  b*o»Uua  »i«v  .™v9koH 

x,ri„u5  srf  ( 3893  ®  gfliojsonoo  svsrf  blood.  »nc  nol30»noinl  mtmle  :* 

.8,088901  »Ob  3i  38riw  8.30M.*.  3±  riotrtW  d3*V  ^0630330  »rf*  00d3  ,3080  '■ 
y,8889390  0  8l  .3893  8  1.  ,31Ild1I«  H3  3.1  nol3039bl. 

9,0  830913111903  y,lXld8li9,  9d3  .Bb,8bfi83Z  lo  398  30989,q  0.  ■' 

< 

.anot38h090iaO39it  l«3lCria»I  9ri3  3.1  80  89<«3  I039V98  0301  b.l  t  .88,  3  3  -1 

0d3  01  b9 b09irU8O39 ,  sl  31  .1  >!  •?  .M«  +IMM  b00  838:  1*3  »I 


40 


meaning  of  whatever  coefficient  they  report.  The  rationale  for  the 
presentation  of  a  descriptive  rather  than  a  categorized  type  of  reli¬ 
ability  coefficient  is  that  different  methods  take  account  of  different 
sources  of  error,  which  when  clearly  labeled,  is  the  most  informative 
outcome  of  a  reliability  study.  If  this  approach  is  used,  it  is 
imperative  that  the  method  used  to  derive  the  reliability  coefficient 
be  clearly  described.  The  impetus  for  this  trend  appears  to  have 
resulted  from  suggestions  made  by  Cronbach  et_.  al.  (1963)  in  Theory 
of  Generalizability :  A  Liberation  of  Reliability  Theory. 

Interdependencies 

Factors  Affecting  Reliability  and  Validity.  Reliability  is 
dependent  upon  various  determining  factors,  such  as  speed  of  work, 
heterogeneity  of  subjects,  length  of  test,  difficulty  level  of  the 
items,  and  approach  used  to  estimate  reliability.  In  general, 
reliability  is  a  function  of  item  by  person  tested.  The  parallel 
form  estimate  of  reliability  is  often  considered  to  be  a  lower  bound 
because  it  includes  form  to  form  and  time  fluctuations  in  its  definition 
of  error.  For  the  above  reasons,  a  parallel  form  estimate  is  often  the 
preferred  measure  (Helmstadter ,  1964).  Split-half  reliability  is 
usually  regarded  as  representing  the  upper  bound  of  the  true  reliability. 
This  is  especially  relevant  when  applied  to  tests  having  a  large  speed 
component.  Homogeneous  tests  are  likely  to  be  more  reliable  than  hetero¬ 
geneous  tests  whereas  scores  obtained  from  heterogeneous  groups  are 
likely  to  be  more  reliable  than  scores  obtained  from  homogeneous  groups 
(Ebel,  1965).  As  the  length  of  a  test  is  increased,  the  reliability  of 
the  test  increases.  The  relationship  between  test  reliability  and 


1 

' 


41 


test  length  is  expressed  by  the  generalized  Spearman-Brown  formula 
(Gulliksen,  1950  a).  "Contrary  to  popular  belief,  a  good  test  seldom 
needs  to  include  items  which  vary  in  difficulty"  (Ebel,  1965,  p.  339). 
When  items  have  a  difficulty  level  of  .5,  more  variable  scores  are 
obtained  from  a  test.  The  reliability  of  a  test  is  likely  to  be  higher 
when  there  is  a  maximum  score  variance  resulting  from  the  use  of  items 
having  difficulty  indices  near  .5. 

The  Kuder-Richardson  Formulas  for  estimating  the  reliability 
of  a  test,  r  ,  depend  upon  item  statistics.  They  were  developed 

XX 

because  of  dis-satisfaction  with  split-half  methods.  The  use  of  item 
statistics  removes  such  biases  as  may  arise  from  arbitrary  splitting 
into  halves.  When  an  accurate  and  practical  formula  is  required, 
calculation  of  the  reliability  coefficient  for  a  test  is  generally 
estimated  by  using  the  Kuder-Richardson  20  (KR-20)  formula.  The 
relationship  of  item  analysis  data  to  reliability  may,  perhaps,  most 
clearly  be  demonstrated  by  means  of  the  following  equation.  The 
expression  is 


where  K  is  the  number  of  items  in  the  test, 

is  the  item  variance  which  equals  -  p^ 

P  is  the  difficulty  of  item  g, 

“g 

r  S  is  the  item  reliability  index,  and 
“Xg  “g 


9M  a.ioce  »Xd.l«v  «o*  1c  X»v»X  ylXusilUb  •  9v«i  Mt  :  ”»rfw 


I9r)gW  .d  o*  ,i«UI  «X  de.3  s  1»  t*4*U»*  Mtt  *•*>  *  *”»  **“Pd° 

—  X 


,ril  .elumjol  (OS-JW)  0£  mMUH«U  ,dJ  aoteu  Td  .toM»UM 


42 


r  is  the  reliability  of  the  total  test. 

Item  Mj»M  is  an  item  in  test  "x".  Although  the  KR-20  formula  yields 
accurate  results,  considerable  work  is  required  in  calculating  r^  . 
The  most  common  modified  KR  formula  proposed  is  that  known  as  the 
KR-21  formula.  If  the  item  difficulties  are  very  nearly  equal,  the 
KR-21  formula  will  provide  a  quick  estimate  of  the  lower  bound  of  r 
This  formula  only  requires  information  regarding  the  test  mean, 
variance  of  the  raw  scores  and  the  number  of  items  in  the  test.  The 
estimate  obtained  by  using  KR-21  is  generally  lower  than  that 
calculated  by  using  formula  KR-20  whereas  the  odd-even  estimate  will 
generally  be  higher  than  the  KR-20  value. 

The  corresponding  general  formula  for  validity  is  presented 
below.  We  have 


r 

xy 


r  S 

yg  g 


r  S 

xg  g 


where  r  is  the  point  biserial  correlation  of  item  g  with  the  criterion  £, 

Tg 

r  is  the  point  biserial  correlation  of  item  g  with  the  test  jx>  and 
-xg 

r  is  the  correlation  between  the  criterion  and  test  (Gulliksen, 

— xy 

1950  a). 

Since  transformations  of  test  scores  can  be  used  to  obtain 
scores  with  a  specified  mean,  variance  and,  within  certain  limits, 
the  form  of  score  distribution,  it  is  suggested  that  test  construction 
procedures  may  profit  when  an  emphasis  is  placed  upon  producing 
reliable  and  valid  tests.  An  attempt  to  produce  a  single  test  that  is 


.  i  j  ■  -  :  -’L’;  - 

3  SoJcjbIvdX£d  al  btnlvpev  ei  *9ow  sldMiebleaoo  .alive* i  ®3B3 usoa 

,  i0  booed  18«0X  oris  io  63  -3»*  3l  >iup  •  #otvoi«  Ui»  slvmoi 

u  ,  aadt  »<.oX  *11  .3  ■  >Jt  IS-ffll  jnian  *d  *-  >3<J6  -j*«X3»b 

' 


bos  ,JC  3ao3  Oris  ri3t»a  "fflai  *>  00X30X07300  Xi  I*  »M  io.'oq  o  i3  »1  fy 

' 

■ 

goXsobooq  ooqo-boOBXq  ot  eXooriqm*  o»  noriw  siloiq  yao!  asuiboooiq 


43 


both  highly  reliable  (in  the  internal  consistency  sense)  and  also 

highly  valid  is  truly  a  meritorious  task.  Unfortunately  the  two  goals 

are  incompatable  in  some  respects.  The  requirements,  as  outlined  by 

Guilford,  for  maximal  reliability  and  predictive  validity  are  as  follows: 

Maximal  reliability  (internal  consistency  type)  requires 
high  intercorrelation  among  items;  maximal  predictive 
validity  requires  low  intercorrelations.  Maximal 
reliability  requires  items  of  equal  difficulty;  maximal 
predictive  validity  requires  items  differing  in  difficulty 
(Guilford,  1965,  p.  481) 

Thus,  there  must  be  some  compromising  of  aims  since  both  reliability 
and  validity  cannot  be  maximal  especially  when  there  is  a  restriction  on 
the  number  of  items  used  to  construct  a  test.  An  optimal  situation  may 
be  to  treat  both  properties  with  equal  emphasis.  However,  to  "err  on 
the  side  of  (high)  validity,  which  after  all,  is  the  more  important" 
(Guilford,  1965,  p.  481)  will  probably  lead  to  the  construction  of  a 
highly  acceptable  test. 

The  number  of  measurements  (items)  used  in  constructing  a  test 
will  influence  results  calculated  from  the  data.  Gulliksen  has  shown 
that  "increasing  the  length  of  a  test  K  times  multiplies  the  mean  by  K, 
provided  that  each  of  the  new  parts  is  parallel  to  the  original" 
(Gulliksen,  1950  a,  p.  69).  Lengthening  a  test  K  times  increases  the 
variance  of  gross  scores  as  indicated  in  the  following  equation 


Sc2  =  S12  K  [1  +  (K  -  1)  r 12] 


where 


is  the  variance  of  the  unit  length  test, 


K  is  the  ratio  of  the  number  of  items  in  the  new  test  to  the 


,1  ,8  „w3  adJ  *Ia3su«Moi  ;U  .*8*3  aoolio3i»ot  a  X  W 


ImiIxm  ;  ■JJwoJTr  tb  l*i  pa  \o  astlu?*?  \3X  Ito »-t  ^ 

381. J  6  gn±3ooT3enoo  al  bsau  (ao93i)  aimwawaa-a  *«  •'  -«ua  1  ,r 
nwofie  aerf  nae^l  .luD  .sJai:  arf3  no  ,3  baJaXualaa  allo.o,  oaoaoltet  HI- 

$ri3  jsas loot  ssmlJ  i»  -3 -a  nl.i  i,  (n».;  •  (l  '  *9 

,1,6001  anlwoilol  arts  ni  baJaoifaoi  aa  ‘  ^“e  «*01»  10  90“i,,y 

C 


Sidrfw 

t3ao3  3inu  arf3  }o  sonelisv  a*i  .*-c  jii 


number  in  the  unit  length  test, 

r_^2  is  the  correlation  between  the  two  parts,  and 
2 

Sc  is  the  variance  of  the  lengthened  test. 

Similarly, 

Sc  =  S1  /  K  +  K  (  K  -  1  )  ru 

is  the  formula  relating  the  increased  length  of  a  test  to  its  standard 
deviation  (j3^)  where  r^  is  the  reliability  of  the  unit  test  (Gulliksen, 
1950  a). 


The  effect  of  test  length  on  reliability  is  given  by 


K  r 


11 


R 


kk 


1  +  (K  -  1)  r 


11 


which  is  known  as  the  general  Spearman-Brown  formula  where  R  is  the 

KK 

reliability  of  the  lengthened  test.  The  relationship  between  test  length 
and  its  validity  is  given  by 


nr 


/  1  +  (K  -  1)  rn 


where  r^  is  the  validity  coefficient  of  the  unit  test  and  R^  is  the 
augmented  validity  coefficient.  As  the  test  length  is  increased, 
reliabilities  approach  unity.  However,  in  contrast  to  reliability 
"the  validity  coefficient  is  usually  considerably  smaller  than  the  test 


reliability  (which)  usually  means  that  changing  the  length  of  a  test 
can  be  expected  to  have  only  a  very  slight  effect  on  the  validity  of 
the  test"  (Gulliksen,  1950  a,  p.  90). 


i.  1  "  ' 


.odUIuO)  3393  3ixiu  9rf3  lo  TSiXJfc<UiX»i  »rf3  a  UJ  »39riw  (Z)  noisav  b 


.(*  oeex 

xd  navi  ■,  ai  Y^XJkdaiXai  no  liiansl  aasi  *o  ipai  :o  ori : 


- 

(r»  w  +  x 

yd  nsvl^  ti  x3XblI;iV  ®3i  bna 


45 

In  a  discussion  of  reliability  and  validity,  Rozeboom  (1966, 
p.  422)  says  that  "it  is  debatable  whether  the  practical  benefits  of 
reliability  theory  are  sufficiently  bountiful  to  recompense  the  labour 
that  test  theorists  have  invested  in  it.  The  primary  justification 
of  reliability  theory  lies  in  abstract  curiosity".  The  one  useful 
aspect  of  the  test's  reliability  index  is  that  it  is  an  upper  limit  to 
its  validity.  Thus,  information  about  test  reliability  is  especially 
important  when  data  are  not  available  on  a  test's  validity  for  its 
intended  purpose.  When  validity  estimates  are  available,  the  test's 
reliability  is  a  matter  of  indifference  (Rozeboom,  1966). 

The  problem  of  obtaining  a  suitable  criterion  arises  whenever 
a  prediction  is  to  be  made.  At  times  samples  are  selected  that  have  a 
marked  restriction  of  range  on  the  resulting  test  scores.  A  failure  to 
cross-validate  can  lead  to  exaggerated  claims  as  to  the  effectiveness 
of  the  prediction  or  selection.  Apart  from  the  criterion  problem,  there 
are  the  issues  of  guessing  and  faking,  and  response  sets. 

The  Criterion.  "The  so-called  criterion  problem  refers  to  the 
fact  that  in  many  cases  it  is  extremely  difficult  to  obtain  adequate 
evidence  for  the  validity  of  a  test  because  no  criterion  appears  to  be 
completely  satisfactory"  (Helmstadter ,  1964,  p.  145).  Since  the  concept 
of  predictive  validity  involves  the  correlation  of  a  psychological 
measure  with  a  special  kind  of  measure  called  the  criterion,  one  of  the 
first  tasks  is  to  define  a  conceptual  criterion  by  means  of  verbal 
statements  from  which  a  criterion  measure  is  developed  that  is  stated  in 
operational  terms.  "The  only  method  for  "validating"  a  criterion  measure 


io  aiiisnsd  **>iao.3«  is*>»**  aXd.38d»b  si  JX”  3ad3  «C«  <*£A  -1 

floXSBoliXxeot  sriT  .il  at  bsiasvni  svBrf  axaloosrfx  3«a3  lad) 


OJ  3 toil  xsqqo  n*  aX  3i  3ari3  eX  xabni  *»XiXd«iX S3  »«*  i°  l39<!«fl 

E-38S3  9ri3  .sXdaXlBVB  sob  B93b*13b9  xJlbiXsv  nartV  .saoqisq  bsbnsxai 
qsvsnsriu  asalis  floiisilxo  sXdsxiua  s  soiflta3do  io  osidoiq  sdT 

.83*8  senoqesT  fans  .gnXolai  bos  *nXaaso«  io  esuaaX  sd3  S3B 

Sd  03  .,BSqqB  OOX3S33  0  OB  9  UBO.d  3893  f.  iO  *3lbiXBV  sd3  30i  SDOsblVS 


XboX joiodiyaq  io  00X38X93300  >ri3  tsviovoi  "- 1  u,n 

Isd^v  lo  Bneam  yd  .iol«*±T&  IbuS^du**  £  an -5  ►*>  ^  «*  ®:  BB;*  i8  li 


46 

is  a  logical  analysis  of  its  relevance  to  the  conceptual  criterion" 
(Astin,  1964,  p.  811). 

An  outline  of  the  problems  involved  in  the  use  of  criterion 
measures  has  been  subdivided  into  three  general  categories  as  follows: 

1.  The  Nature  and  Role  of  the  Criterion  (definitions, 
common  fallacies  about  criteria,  and  certain  logical 
and  technical  considerations  in  developing  criterion 
measures) . 

2.  Criteria  and  Test  Development  (the  function  of 
criteria  in  the  construction  and  validation  of  tests) . 

3.  Criterion-Centered  Research  versus  Construct 
Validity  (similarities  and  differences  between  the 
two  approaches,  and  the  case  for  criterion- 
centered  research).  (Astin,  1964,  p.  807) 

Cureton  (1951,  pp.  626  -  674)  and  Horst  (1966,  pp.  334  -  347)  have 

directly  related  the  criterion  problem  to  test  validity  in  detailed 

presentations  both  entitled  Validity. 

When  a  test  is  being  constructed,  the  ends  desired  in  an  applied 
setting  should  first  be  established  by  defining  those  ends  in  terms  of 
a  set  of  criteria.  Thus,  specification  of  conceptual  criteria  and  some 
attempt  at  criterion  development  appear  to  be  important  preliminaries  to 
the  construction  of  any  test  which  is  designated  for  applied  use. 

Item  Analysis.  "The  Major  goals  of  item  analysis  are  the 
improvement  of  total-score  reliability  or  of  total-score  validity,  or 
both,  and  the  achievement  of  better  item  sequences  and  types  of  score 
distributions"  (Guilford,  1965,  p.  493).  The  commonly  used  descriptive 
statistics  for  item  parameters  are: 

1.  The  proportion  of  persons  answering  each  item  correctly. 

This  quantity  is  a  measure  of  item  difficulty. 


■ 

.(1X5  q  <a  3aA) 


fl0  -33"I3  aulqolavdfa  ni  aroiripasbiaaoo  laoinnoai  n& 

.(33  to  uoi3ab.  Iflv  bus  flot^u^arfnoD  on  n  b 


io  iiiaaoD  s.  as  i*v  &»%&  e  ai03«  >')- floliinxi  -. 

(K08  .t;  tWl  tnx3^A)  .(ri-naoaai  bereJaao 
.,vsrr  vV  ■  *££  ««'  )  b  i  '  '  qq  •  ,,*-r)  «o3*ii»J 

i£ 


bsllqqfi  na  nl  bsiiaab  abns  0ri3  tba3:>in3enoo  gfllad  a*  ®  **n’W 

io  anna3  al  .fa*  t.eorfe  8«**£*»*  *d  L,delld*3,*  ad  daili  blnoria  8al33a8 


91058  lo  89qq3  bn*  8851»«P8E  B9ll  M«9<J  10  30»»tt>V8irf5«  9dl  bO.  ,dj»d 

.(Ce»  .q  ,l»9 1  .bioliioO)  "snolJodlUBlb 


:  1  ,  Jl»30£B£XBq  i  :•  i  30^  3Q13B  fa3E 


47 


2.  The  reliability  index,  which  is  the  point-biserial 
correlation  between  item  and  total  score  multipled  by 
the  item  standard  deviation.  A  reliability  index  is  not 
equivalent  to  the  index  of  reliability. 

3.  The  validity  index,  which  is  the  point-biserial  correlation 
between  item  and  criterion  score  multipled  by  the  item 
standard  deviation.  (Gulliksen,  1950  a,  p.  385) 

An  item  analysis  essentially  provides  two  kinds  of  information.  It 
provides  an  index  of  item  difficulty  and  an  index  of  validity,  where  the 
term  validity  is  used  in  a  very  broad  sense.  These  indices  may  show 
how  well  the  item  discriminates  in  agreement  with  the  rest  of  the  test, 
generally  the  total  test  score  as  an  internal  criterion,  or  how  well  it 
predicts  some  external  criterion.  Item  validity  is  thus  a  case  of  construct 
validity  when  the  criterion  is  the  total  score  and  predictive  validity 
when  one  uses  an  external  criterion.  The  homogeneity  (internal  consistency) 
of  a  test  is  increased  when  items  are  selected  which  correlate  highly 
with  total  score. 

Short-cut  methods  of  estimating  these  parameters  from  a  portion 
of  the  data  have  been  presented  by  several  writers.  Kelly  (1939) 
suggested  that  two  special  criterion  groups  be  formed:  an  upper  group, 
consisting  of  27  percent  of  the  total  group,  who  received  the  highest 
total  test  scores  and  a  lower  group  consisting  of  an  equal  number  from 
those  who  received  lowest  scores.  Item  analysis  data  would  be  calculated 
from  this  portion  of  the  data.  Graphic  procedures  may  be  used  to 
calculate  item  difficulty  and  item  discrimination  indices.  Guilford  (1954) 


job  et  xabni  A  ,Bul**l¥wb  bi*bnM«  mail  »riJ 


.yalXidoilsi  lo  xsbnl  »di  oJ  JnsiavXups 

noJtaalenoD  I*l:re8id-*3njfcoq  »d3  ai  riolrfw  tx9  <3l >4X 


luusnoa  lo  ®e«a  a  aorfl  ei  ylibXX.v  w«  -noia  iiia  Im.lxa  .«*.  .aoiboiq 


;oa9*81.«oo  imeiak)  ^isae^mod  arfT  .aoifJlw  l^alx*  ns  se*u  »no  naxiw 


3  .  L,  I-  fc»v  .3  oifcr  ,i»<  IS  •  3-0.3  t.  J  lo  Jne  v:  '«  •’ 1  1  a  ‘  ’ 


.  en  ioiJ  ,3  .i>:  o»  i  -o  i  '  i  ‘  o  X  Xuomib  toJI  -■  ■■»■  3 


48 


and  Helmstadter  (1964)  provide  a  detailed  explanation  of  these  techniques. 

However,  with  the  increasing  use  of  modern  electronic  machines, 
short-cut  methods  will  become  increasingly  less  desirable  since  the 
computational  labour  is  reduced  considerably  and  more  information  is 
provided  by  using  all  responses  in  a  more  rigorous  procedure. 

The  difficulty  level  of  items  will  determine,  to  a  large  extent, 
the  shape  of  the  test  score  distribution.  If  a  multiple-choice  test 
is  used,  the  number  of  alternatives  used  for  each  item  usually  restricts 
the  range  of  probable  scores.  The  range  would  be  from  approximately 
20  percent  correct  answers  to  100  percent  correct  answers,  allowing  for 
random  guessing,  when  using  five  alternatives.  In  a  practical  test 
situation  this  would  undoubtedly  vary  since  the  alternative  selected  is 
not  generally  made  in  a  purely  random  fashion.  One  method  of  selecting 
items  to  result  in  a  wide  spread  of  test  scores  is  to  use  the  rule  of 
selecting  a  set  of  items  whose  average  difficulty  level  is  near  the 
middle  of  the  possible  score  range.  Although  "the  ideal  distribution  of 
difficulties  varies  in  terms  of  the  use  which  will  be  made  of  the  test 
and  the  intercorrelations  of  the  items  ...  a  good  general  procedure 
is  to  choose  approximately  an  equal  number  of  items  at  each  difficulty 
level  in  the  possible  score  range"  (Nunnally,  1959,  pp.  146  -  147). 

While  the  multiple  correlation  approach  is  generally  accepted 
as  being  the  superior  method  for  predicting  a  criterion  from  a  test 
composed  of  several  items,  in  order  to  maximize  validity,  the  procedure 
most  frequently  used  to  select  items  to  form  a  test  is  that  based  on 
item  analysis.  After  the  item  characteristics  have  been  determined, 


aonJfcs  ©Xdfiilaeb  ess*I  ^liisAs  11  4J''0llJ9m  3uo-i: 


.31  3X1.  •»•<.-.  •  «:  ...  Unt  J»b  It  >-’  »®  X9veI  3  b  047 


.8 s  rt:  7#*I*  evil  8flJtew  nariw  cgnl8E»0«  mobnfti 
t  beJQ:  las  svisamiJii  ««  *:>=  b  <f-rov  xIb»3duobflU  blt/ow  *id3  aoUzuHe 

Bjubaooiq  sd3  ,q3lbkUt  sximbxM.  o3  istao  al  ,«*»*!  I«mvM  *«»  *•«<>«“»» 


49 

either  analytically  or  graphically,  from  the  empirical  tryout  of  the 
pool,  a  number  of  statistical  procedures  can  be  used  to  help  construct 
a  test.  "The  purpose  of  item  analysis  is  to  select  from  an  item  pool 
a  minimum  number  of  items  which  will  give  a  maximum  prediction  of  a 
criterion"  (Nunnally,  1959,  p.  144). 

A  guide  for  selecting  items  to  construct  a  test  is  to  use,  in 
general,  items  in  the  difficulty  range  of  .20  to  .80.  Items  more 
difficult  than  the  .20  level  would  not  likely  be  answered  by  many 
students  whereas  items  of  .80  difficulty  or  greater  may  be  so  easy 
that  one  is  only  adding  a  constant  to  the  individual’s  score  since 
nearly  everyone  receives  credit  for  this  question.  Several  relation¬ 
ships  for  a  discrimination  index  have  been  presented.  Various  combina¬ 
tions  of  proportions  calculated  for  the  upper  and  the  lower  groups  have 
been  used.  A  correlation  between  criterion  or  total  score  and  item  is 
sometimes  used.  The  acceptable  level  of  correlation  coefficient  would 
depend  upon  the  degree  of  item  homogeneity  or  item  heterogeneity 
desired  by  the  test  constructor.  In  the  case  of  an  achievement  test, 
the  judgement  of  the  subject  matter  expert  must  always  play  an  important 
part  in  the  selection  and  rejection  of  items. 

Factor  Analysis.  It  was  shown  that  an  observed  score  can  be 
divided  into  two  additive  components,  a  true  score  and  an  error  score. 

"In  factor  analysis  it  is  assumed  that  the  true  score  can  be  further  sub¬ 
divided  into  additive  components  due  to  various  common  factors  and  a 
factor  specific  to  each  test"  (Baggaley,  1964,  p.  98).  Essentially, 


;ass  O.  Bd  -.3  a*  30  V3lx.-:  HUb  08.  lo  w  .3i  »!  »•  K.'iv  '■  3310 bu3e 

.nob 3  iup  sbd3  301  sibt-ia  miBMi  •a«r»v*  *•»»»« 

3CB3  oqmi  ns  volq  i  vsk?  38.  t  333-  £9  3»33i  3=>“t  u.  9<»  'o-  *8^’  -,1J 


sd  nso  siooe  bdvi9edo  os  *feri3  nworia  bbw  JT 


50 


the  principal  concern  of  factor  analysis  is  the  resolution 
of  a  set  of  variables  linearly  in  terms  of  (usually)  a 
small  number  of  categories  of  "factors".  This  resolution 
can  be  accomplished  by  the  analysis  of  the  correlations 
among  factors  which  convey  all  the  essential  information 
of  the  original  set  of  variables.  Thus,  the  chief  aim  is 
to  attain  scientific  parsimony  or  economy  of  description. 

(Harmon,  1960,  p.  4) 

Guilford  (1965)  has  shown  that  many  of  the  concepts  of  validity,  e.g., 
predictive  validity  and  multiple  correlation  principles,  are  explainable 
on  the  basis  of  factor  theory. 

The  essential  new  step  is  to  assume  that  the  variance  can  be 
further  broken  down  into  independent  additive  components  of  common 
factor  variance,  specific  variance  and  error  variance.  Communality  is 
defined  as  the  proportion  of  common  factor  variance  in  the  test  scores. 

The  proportion  of  specific  variance  in  a  test  is  known  as  its  specificity, 

2  2 
which  is  symbolized  by  _S  .  Error  variance  is  denoted  by  je  .  In 

equation  form,  symbolizing  total  variance  by  1.0, 

2  2  2 
1.0  -  h  +  s  +  e 

The  specificity  plus  the  error  variance  is  called  the  uniqueness  of  a 
test,  or 


„2  2  .  2 

U  =  s  +  e 

When  factors  are  uncorrelated,  factor  loadings  are  always  the 

coefficients  of  correlation  between  the  respective  factors  and  the 

variables  that  were  factored.  The  correlation  between  two  tests  is  the 

sum  of  the  cross  products  of  the  common  factor  coefficients  or  factor 

loadings.  In  equation  form,  symbolizing  the  correlation  coefficient  by 

r.#  and  the  factor  loadings  by  a.  and  a.  , 

-i  3  “ip  “JP 


A 

rij 


n 

la.  a. 
>-i  Jp 


”  l£,V-  vli''  '■ 

a  (vllk  u)  lo  . .  n$l  fll  q/iesalX  B3ldB±i3V  to  l*z  &  o 

■  0  A  • 

‘  '  o  • fl ‘  ■ 

t.g.a  j  **i3  \OBsa  3*ri3  nwcn’i  :  ( 

sldfinlBlqxs  .aeXqlotfiiq  noi3sXt.‘  tqJXtim  bn*  x^biXav  svX3oibsiq 

' 

nomaioo  lo  aSflsnoqraoo  svl^lbba  >3n3bfl3q©bfli  o3nX  >  nvob  na^o^d  iari3^ 

' 

■ 

8  3a£>3  s  at  *:>X£ 

■ 

b  to  eeaaeuplnu  srh  b^Il&o  aX  zoneliev  ioiia  ai;J  V^io  1.  o  . '-  *rir 

■ 

t  ;sw3*d  n  3  J  '  -  .iod 

fi  • 

K<f  pnaloJ  ■  .LsUcsinye  'BnoJ  aoltavpm  ol 


51 


til 

where  _i  and  j_  refer  to  tests  and  jd  denotes  the  jd  common  factor. 

A  test  score  can  be  represented  in  a  space  as  a  point  using 

the  co-ordinates  given  by  the  factor  loadings.  The  same  geometry  also 

holds  for  an  item  of  a  test.  If  an  item  was  represented  by  the  vector, 

V.  of  length  h.,  and  another  item  by  vector,  V,  of  length  h.,  the 
1  — 1  — j  — j 

relationship  between  the  vectors  can  be  shown  as 


form, 


COS  6 . . 

ij 


1 


h. 

l 


r .  . 

iJ 


When  several  items  have  high  loadings  on  a  single  factor,  an 
indication  is  given  of  the  internal  consistency  reliability  index  for 
these  items.  The  items  may  be  regarded  as  being  comparable  measures 
of  the  same  hypothetical  variable.  If  the  correlations,  in  a  correlation 
matrix  that  is  to  be  factored  are  uniformly  high,  high  factor  loadings 
will  result  yielding  a  high  internal  consistency  estimate  of  reliability. 
Also,  when  an  item  and  a  criterion  both  have  high  loadings  on  the  same 
factor,  it  is  an  indication  of  validity  for  predicting  that  factor. 

"The  correlation  of  a  test  with  each  common  factor  (a  common  factor 
loading)  is  its  coefficient  of  validity  for  measuring  that  factor" 
(Guilford,  1954,  p.  399). 


..  >V  0/5  *f\  1  If  X  3  ■>  9»  JXJ  /gat  ol  5» 


e  300 


ns  ,  103063:  si  n  &  no  .':  c  d;>  rt  :*vaif  r&3i  leaf  as  x»  dVi 

re  I  t  ■  3.  1  I »  -  I  n  o  •-•  i3  io  13  Lj  ei  xxol  JaoJtDflJ: 


ir  ^UL-fc.frr  aid  ic  >  <  rtsd  r  babisgo'r  ad  xB®  Bm«3i  »;' 

.soJoai  a^rid  jr  ^  b:  iq  loi  y;3xbll  nv  to  noxJa  b  i  Ob  z  3  Jt  #io:oal 
io3oe5  /io/tjtioo  a)  lodoai  uomraoj  itona  rf:uw  :  aaa  a  io  noi;3Bl»iaoo  BriT' 


.($e£  .q  ,Aeei  ,blo*IJti/Q) 


52 


Factor  analysis  can  be  used  as  an  aid  to  solve  the  weighting 
problem  for  many  different  items  forming  a  single  test  score.  From  an 
entire  set  of  test  items  that  have  been  factored,  a  common  factor  score 
can  be  calculated  for  each  individual  by  the  appropriate  weighting  of 
his  scores  on  the  original  variables.  The  procedure  may  be  used  to 
calculate  a  single  factor  score  derived  from  the  first,  and  also  the 
largest,  factor  or  alternatively  a  factor  score  may  be  obtained  for 
each  individual  on  every  factor. 

Where  there  is  doubt  concerning  the  psychological  homogeneity  of 
the  items  forming  a  test  or  where  the  item-total  correlation  tends  to 
be  low,  factor  analysis  may  be  used  to  divide  the  test  into  subtests, 
each  of  greater  homogeneity.  The  assumption  is  that  tests  with  high 
internal  consistency  are  desired. 

Relationship  of  Item  to  Test  Score.  Basically,  in  preparing 
a  test  one  is  concerned  with  the  problem  of  selecting  items  so  that 
the  resulting  measurement  instrument  will  have  certain  specified 
characteristics.  Flowers  (1965)  has  suggested  that  given  the  item 
means,  measures  of  item  conformities  and  measures  of  item  validities, 
it  should  be  possible  to  assemble  a  test  which  could  satisfy,  within 
certain  limits,  a  prescribed  mean,  standard  deviation,  reliability, 
validity,  skewness  and  kurtosis.  Although  Gulliksen  (1950  a)  does  not 
suggest  that  skewness  and  kurtosis  should  be  completely  ignored  since 
they  pose  many  as  yet  unsolved  problems,  he  limits  his  suggestion  to 
the  possibility  of  selecting  items  to  influence  the  test  mean,  variance, 
reliability  and  validity. 


.i  *008  3.^1  olg^la  a  patent  ara^3X  3a®i**llb  *£*nj  ^  rfoiq 


io3t  bsnjtBJdo  erf  *3008  io3oal  a  ^X9vi3Bcrt93XB  10  3o3oai  ,3b981bX 

,io30i  i  yisva  no  Xaubi'.  nl  rfofii 

io  Jisne  ionrorf  XaoXgoXorfox0*!  ad3  anJtaia-JCQb  3daob  al  si  \3  »3  ».1V 

,838  »3duB  o 3ni  3&b3  9d3  sblvib  03  bsao  ad  x»a  a Xa\Una  t©3ob*  ,woI  sd 


rv.  b  biabnB38  , naofn  bsdlioaa^q  a  ta3liniX  nx«3i9o 

oj  >  J89W  U8  aid  a  Mr  tl  ad  Xciouq  b»*  xem*  3*x  «‘J  V1’*®  3  °<i  <drf* 


.aonfiliav  ,ns3fn  3893  a  13  iorreuXIinl  o3  8.3X  ’  .ix.  ■ r io  i  3.  XXd  8'o<j  «b3 


„^3-vb.  it  bn;>  v3Jt/  dfcXXsn 


53 


If  transformations  of  test  scores  can  be  used  to  obtain  scores 
with  a  specified  mean,  variance,  and  within  certain  limits  the  form  of 
the  score  distribution,  it  is  suggested  that  test  construction  procedures 
may  profit  when  an  emphasis  is  placed  upon  producing  reliable  and  valid 
tests.  "Tests  composed  of  items  answered  correctly  by  about  50  percent 
of  the  group  have  a  higher  validity  than  tests  composed  of  items  that 
are  easier  or  harder  than  50  percent,  but  otherwise  of  the  same  type" 
(Gulliksen,  1950  a,  p.  374).  It  has  been  shown  by  Gulliksen  that  the 
formula  for  calculating  test  validity  does  not  show  any  direct  relation¬ 
ship  between  test  validity  and  item  difficulty,  but  test  validity 
however,  does  depend  on  the  point-biserial  item-criterion  correlation. 

Theoretically,  the  problem  of  selecting  a  subset  of  k  items 
from  a  total  group  of  K  items  as  well  as  the  problem  of  maximizing 
test  validity  for  predicting  any  specified  criterion  has  been  solved. 

A  completely  accurate  solution  is  obtained  by  using  the  interitem 
variance-covariance  matrix  to  select  the  one  subset  of  size  1c  that  has 
the  highest  validity.  The  procedure  is  very  laborious.  Gulliksen  (1950  a) 
has  reviewed  several  approximation  procedures.  If  the  complete  inter¬ 
item  variance-covariance  matrix  and  the  item-criterion  covariances  are 
available,  a  maximum  test  validity  may  be  obtained  by  solving  for  all 
multiple  correlations  or  for  all  multiple  correlations  using  a  specified 
number  of  items. 

The  incompatabilities  of  attempting  to  construct  a  test  with  high 
validity  and  high  reliability,  where  validity  is  the  more  important 
(Helms tadter ,  1964;  Ebel,  1965;  Guilford,  1965),  does  not  justify  any 


lo  adJ  f  j.ii'T  I  nl  Jiso  nlrialv  bn&  ,s anaiiav  ,n*aai  b  >1U  »qe  b  ri3lv 


.^Jlbilav  if.arigld  ©d3 

Daer.  i  3^0  [  d3  ai  ,  iil  ©d  atx  rigid  b,  * 

Kna  ^l3aut  Joa  aaob  « (2d*I  .bioliiuO  ;2dei  .IsdS  |W<!I  t?*3b*3aal»H) 


54 

devaluation  of  high  reliability  as  a  goal  in  test  construction. 

Reliability  is  essential  to  validity.  However, 

validity  is  by  far  the  most  important  criterion  by  which 
a  test  may  be  judged,  for  an  objective,  reliable,  and  well 
standardized  instrument  can  still  be  completely  useless 
unless  the  kinds  of  inferences  which  can  legitimately  be 
made  from  the  test  score  are  known  (Helmstadter ,  1964,  p.  226) 


Correction  for  Attenuation.  When  two  variables  are  correlated, 
the  errors  of  measurement  if  uncorrelated  among  themselves,  lower  the 
coefficient  of  correlation  compared  to  that  derived  from  perfectly 
reliable  measures.  It  is  possible  but  unlikely  that  a  random  change 
in  score  would  make  the  correlation  larger.  McNemar  (1962,  pp.  153  -  154) 
has  derived  the  correction  for  attenuation  formula 


r 

r  =  - 52 - 

Zt  /~r  /"r 

xx  yy 

where  r  is  the  correlation  between  perfectly  reliable  "true"  scores 
— tt 

on  x  and 

r  is  the  correlation  of  actual  scores  on  x  and  y. 

~xy 

r  is  the  reliability  of  the  measure  of  variable  x* 

“XX 

r  is  the  reliability  of  the  measure  of  variable  y_. 

-yy 

A  correlation  coefficient  corrected  for  attenuation  may  be 
regarded  as 

(a)  the  correlatidn  between  true  scores  in  each  of  the  two 
measures  and 

(b)  the  correlation  between  the  two  measures  when  each  is 
increased  to  infinite  length  (and  hence  a  reliability 
of  1.00).  (Gulliksen,  1950  a,  p.  101). 


'  >  •.  s 


.  rstbJtXav  oJ  i  v  JiUtdallaH 


i  I  i  '  t  Joi  , ^  .  .*•  ■  >  6 

(ass  ,  1  ,  I  3  bade  1 1  »H)  0W(  9T0D8  380;:  »xiJ  taos  l  zbzat 


&  ~"'i  ■  v  >  iJ 


' 


55 


Gulliksen  (1950  a)  maintains  that  the  "correction  for  attenuation"  is 
not  actually  a  "correction"  but  rather  is  an  estimate  of  the  correlation 
between  a  perfect  test  and  a  perfect  criterion.  Correction  for 
attenuation  is  actually  a  special  case  of  partial  correlation  with  the 

errors  e  and  e  partialed  out. 

x  — y 

One  practical  application  of  the  correction  for  attenuation  is 
to  determine  what  increase  in  reliability  of  test  x.  or  criterion  or 
both,  would  yield  a  more  satisfactory  value  of  the  validity  r^  .  The 
equation  is  valuable  in  giving  a  quick  indication  of  the  utility  of 
attempting  to  increase  the  test  validity  by  increasing  the  test  length. 

The  correction  for  attenuation  may  thus  be  used  to  indicate  the  most 
profitable  direction  for  further  validation  research.  Another  application 
suggested  by  Gulliksen  (1950  a,  p.  214)  is  in  calculating  a  correction 
for  the  attenuation  due  to  inaccuracy  of  reading  essays.  However, 
while  correlation  coefficients  corrected  for  attenuation  are  of 
theoretical  importance  in  the  analysis  of  relationships  in  that  considera¬ 
tion  can  be  made  for  variable  errors  of  measurement,  they  should  not  be 
reported  with  the  implication  that  the  higher  coefficient  has  already 
been  attained.  Corrected  r/s  cannot  be  used  in  prediction  equations 
as  prediction  must  necessarily  be  based  on  obtained,  or  fallible,  rather 
than  true  scores. 

A  restriction  on  the  size  of  the  validity  coefficient  is  imposed 
by  the  reliability  of  the  criterion.  "It  is  more  important  that  the 
reliability  of  a  criterion  measure  be  known  than  that  it  be  high" 
(Thorndike,  1949,  p.  107)  since  the  following  formula 


nol3BIa„oo  add  lo  adaaXd.a  n*  .1  •  *U-—  3ofl 

*j  „c.,o.™»o  .«*»»**>  •  *"  3883  3~i39q  8  a”w,9d 

add  ddlw  nold»Xa„ob  W»H  to  —  f*~V  8  '‘1X“'3“  81 

aJt  M.  *o»  noXdo.«<*  add  lo  aoldsoXXqq*  I. »*«“«  sr‘ 

.ril8D6X  dad  add  saloa^ool  yd  tdlblX.v  d»d  -*  8“"mX  #*  ■“’T”* 
3,om  eri,  .daolbnX  od  ba*u  ad  Mri*  T»«  «? >“«*»*  ,0i  n°  5S9”0:>  *" 

„  .  gnidsIuoXso  nX  aX  (MX  .a  OKI)  -f*"*  * 

, lavawoH  .ayaaaa  8nXb*.d  to  voedoooanX  od  anb  noid.un9»a  a.  d  »1 
jo  «.  noXdeunaJdi  ~1  bodoadaoo  adnaXolllaoo  noldaXaddOb  MXdw 

9<J  3on  blood*  yadd  lo  edOd«  oXd.X„v  ,ol  a  boo  ad  nan  ooXd 

,bB9,XB  and  dnaXaXllaod  «dSXd  add  dadd  noXd.oXXq«X  add  ddXw  baddoqa, 

anoXdsupa  noldolbadq  nl  baa„  ad  doanao  a'j  badaaddoO  .banXadda  naad 

3addLt  .aXdXXXfll  -to  .banlarJdo  no  baaed  ac  i  a  •’  ocXdoibodq 

.  a  I  o o  a  ei/ii  nfirii 


baeoi *t  al  dnaloXllaoo  ydXbXXav  add  lo  aSXa  add  no  noldoXddaa  •  A 

«  .i  .«.>  «a>  —  “  ‘  “ 

nloonol  gniwoXXol  add  aonia  ((OX  -q  .««*  .*lXbndodT) 


56 


r 


tt 


r 


xy 

fr~ 


yy 

may  be  used  to  provide  estimates  of  the  validity  coefficient  of  the 
fallible  tests  we  are  compelled  to  deal  with.  We  have  a  more  stable 
means  of  comparing  test  validities  if  something  is  known  about  the 
validity  of  the  criterion. 

If  either  r  or  r  is  underestimated,  the  corrected  r  will 
—xx  — yy  — xy 

be  overestimated.  If  either  reliability  coefficient  is  overestimated, 
the  corrected  r^  will  be  underestimated.  A  conservative  approach  would 
be  to  underestimate  the  corrected  r^  .  Also,  the  method  of  estimating 
a  reliability  coefficient  influences  the  value  obtained.  The  question 
also  arises  as  to  which  of  the  three  main  types  of  reliability 
coefficient  is  desirable  in  correcting  for  attenuation.  "In  general, 
the  alternate  forms  approach  is  probably  the  best"  (Guilford,  1965, 


p.  489). 

It  has  previously  been  shown  (Gulliksen,  1950  a,  p.  382)  that 


r 

xy 


Is  r 

g  yg 


Is  r 
g  xg 


which  is  the  ratio  of  the  average  item  validity  indices  to  the  average 
reliability  indices.  Here  we  have  what  Loevinger  (1954)  has  called  the 
attenuation  paradox.  The  empirical  validity  of  a  test  is  decreased  as 
the  internal  consistency  of  the  test,  measured  by  the  item-total  test 
score  correlation,  is  increased.  An  increased  internal  consistency  may 
also  increase  the  external  criterion  correlation  but  beyond  a  certain 


‘it**  1  JOt 


sJCdB33  930.B  8  9VKi  »H  .li3lW  Xa*b  03  b.XX^BOa  938  »W  83893  8  dlUB  i 


9ri3  3ood8  (won*  si  *U«I3««ob  11  .alslbllav  3893  snl3Bq«oo  lo  808.® 

.noiis3iio  sri3  lo  ^Ulbilsv 

Iliw  3  bs3O»33O0  9ri3  ,bs38®1389I9bOO  *1  ^  30  x*3.  19,1  9  11 

ea«X  baoliJtiiO)  "3e»d  ad3  yldadoiq  al  riosooqqa  aano  9  80  . s  add 

.<ea*  .q 


fads  bfaXX  3  F.ari  «««>  19a>ilv9oX  3BriH  .vsd  a:  33oH  .«*airv:f  :*  • 

da*  3  fab  s  i  3-.  i3  8  lo  '  slblXsv  I  al  -t  ■»  ortT  .xobsosq  nol3«*on  is 

YF,®  1(3093  .18003  X6O-.9301  b9?6910Ol  OA  .bo8a833fll  •!  ,00(36. 93103  .3008 
,U83380  8  bOC:£»<  3Ur  00131 ■  i.93300  flol391l33  180393X8  9li  96B910C1  08.6 


57 

point  increase  in  internal  consistency  begins  to  eliminate  relevant 
variance,  thereby  reducing  the  test-criterion  correlation. 

Each  variable  in  a  factor  analysis  is  commonly  treated  as  though 
it  contains  three  components:  common,  specific  and  error  variance.  The 

variance  components  are  illustrated  by 

2  2  2 
1.0  =  h  +  s  +  e 

2  2 
where  _h  is  the  common  variance  (communality) ,  s^  is  the  specific 

2 

variance  and  e^  is  the  error  variance.  Essentially,  the  unique 
variance  which  is  not  common  to  the  other  variables  is  removed  through 
estimation  of  communalities  before  the  analysis  is  begun,  or  by 
selecting  a  small  number  of  common  factors  after  the  analysis  has  been 
completed.  The  reproduced  correlations  are  than  attributable  to  only 


common  factor  variance. 


CHAPTER  V 


REVIEW  OF  ITEM  SELECTION  PROCEDURES 

Exact  procedures  have  been  developed  for  selecting  items  to  form 
a  test  by  using  complete  regression  systems.  Some  procedures  allow 
positive  and  negative  weights  for  each  item  whereas  other  methods  desig¬ 
nate  selection  or  rejection  of  an  item  by  a  weight  of  one  or  zero. 

Because  such  methods  were  regarded  as  too  laborious  computationally  for 
practical  purposes,  several  approximation  techniques  have  been  devised. 

Weighting 

When  a  single  score  is  to  be  derived  from  a  weighted  sum  of  items, 
one  is  faced  with  the  problem  of  determining  the  appropriate  method  of 
combining  these  scores.  One  solution  is  to  select  the  items  for  a  test 
and  then  use  multiple  correlation  procedures  to  determine  the  optimal 
weighting  system  for  each  item  in  the  test  to  predict  a  selected  criterion. 
An  alternative  approach  is  to  select  items  by  using  a  step-wise  regression 
solution.  Theoretically  the  method  of  using  weights  is  the  most  suitable 
for  accurate  prediction  but  it  tends  to  increase  to  a  considerable 
extent  the  time  and  effort  involved  in  determining  an  individual’s  score. 
Gulliksen  suggests  that  some  approximation  to  multiple  correlation  is  to 
be  preferred  to  the  exact  method,  when  selection  is  to  be  made  from 
many  variables,  since  "for  practical  purposes,  simple  integral 
approximations  to  the  exact  multiple  weights  will  usually  give  a  satis¬ 
factory  composite  score"  (Gulliksen,  1950  a,  p.  356).  Douglas  and 
Spencer  (1923)  concluded  that  it  made  very  little  difference  in  the 


V  aai  iAHO 


. 


2 rx)J  v.  ;ti  rio  waivaa 


mio±  03  i0T93Jt  gJiilonXde  10  ^qolav^b  /  :u  d  svfirf  ©©mo&ooiq  :■  jfi>  . 

-gxaob  aborflsm  isrflo  aa9i&dw  ma3x  no s-j  to'  a3 dgJttatr  svldagsa  bna  sv.  J  eoq 
.oias  10  sno  2o  3rfgJt9w  b  yd  raaJx  ab  io  noiioet®*  ™  noi3o®Xs»8  93an 
105  YllBnoJ:3B3oqraoo  suoiiodfiX  oo3  a©  babifiasi  '  aboriJam  riooa  seuaos  ' 


gnllrijqaW 


noiaasigsi  98xw-q93a  b  gnleii  em9;*X  dosXse  o3  r;  doBo  qqs  ©v  .  si,*j nA 
9  dsJxue  330m  9ri:  ax  alrigiy  -  ,r :  .u  to  borflsm  ©ri3  vXXbo2391c  XT 

a  I  dc isblanoo  b  o3  9a69ionl  o.  abxu  3  3  3u  'jc  iJot.  -  J  •  -  .  i  -  o*-  i° r 
.91003  a’Xax  -.  /Ibni  ns  gainii:  ■  >*©b  i  t  bavXovni  Jio^  J9  bm  »m23  ort3  3a© 3xd 


-ailsa  £  9/ia  vXXbusu  XXXw  airigiaw  sXqXdXnijr  Jo  -'9  9d3  03  aaoi  Jxr  ; 'j  ;qq> 
bos  aslguoa  .(dee  .q  « b  oeex  .nsejUXXuQ)  "91008  93Xeoqinoo  xioio&i 


59 


ultimate  outcome  as  to  what  weights  are  assigned  to  each  measure.  They 
found,  for  a  number  of  tests,  that  scores  obtained  with  unit  weights 
correlated  .98  to  „99  with  the  same  scores  obtained  through  use  of 
optimal  item  weights.  It  may  therefore  be  concluded  that  using 
fractional  weights  rather  than  integral  weights  for  different  items  in 
a  typical  test  will  not  prove  significantly  more  valuable  in  arriving 
at  a  total  test  score.  "The  gain  in  predictive  efficiency  achieved  by 
the  use  of  ultra-refined  techniques  of  item  analysis  in  preference  to 
relatively  crude  methods  would  appear  to  be  nominal  at  best"  (Rozeboom, 
1966,  p.  519). 

Regression  Procedures 

Various  procedures  have  been  reported  which  enable  a  test  con¬ 
structor  to  maximize  test  validity  by  selecting  individual  items  from 
a  pool  of  items.  If  a  criterion  is  available,  and  we  desire  to  weight 
the  items  in  such  a  manner  that  the  composite  score  will  have  the  highest 
possible  correlation  with  the  criterion,  the  method  of  multiple  correla¬ 
tion  is  the  one  to  use  (Gulliksen,  1950  a).  The  above  procedure  has 
been  extended  by  Horst  (1961)  to  include  a  set  of  n^  predictor  variables 
and  a  set  of  n^  criterion  variables.  If  the  n.^  +  n^  variables  for  the 
same  individuals  are  available,  a  linear  combination  of  the  predictor 
variables  and  a  linear  combination  of  the  criterion  variables  can  be 
calculated  which  will  yield  the  highest  possible  correlation  between  the 
composites.  Gulliksen  maintains  that  multiple  correlation  methods  give 
the  best  weights  for  predicting  the  criterion  but  "simple  integral 
approximations  to  these  weights  will  usually  give  a  composite  score 


e  a  • 


* 


bv  orf  -  i  .v  i  *7  6  -  3 X.  tuj  fli  <-  ilXIiJO  »  *3  •  oqxr.co 

,;o  3  k;  a  »v .  '  r&u  J  a  i  £  *■»  ^  J  ro  .’fiinXxo^qqs 


60 


that  correlates  almost  as  well  with  the  criterion"  (1950  a,  p.  330). 

The  above  procedures  involve  finding  the  "best"  weighting  system  for 
a  given  set  of  items  whereas  a  step-wise  regression  procedure  allows 
one  to  select  an  item  at  a  time  which  will  result  in  an  ordered  selec¬ 
tion  of  the  subset  of  _k  items  that  best  predicts  the  criterion.  How¬ 
ever,  "the  precise  method  of  weighting  is  not  important  unless  we  are 
dealing  with  relatively  few  tests  that  are  not  highly  correlated" 
(Gulliksen,  1950  a,  p.  327). 

The  procedure  for  predicting  an  external  criterion  by  multiple 
correlation  is  outlined  by  Gulliksen  (1950  a)  in  Theory  of  Mental 
Tests .  Several  methods  of  selecting  items  for  a  test  by  approximations 
to  multiple  correlation  have  been  published. 

Approximation  methods  to  multiple  correlation  have  been  developed 
which  are  used  to  assemble  a  collection  of  items  whose  composite  score 
would  have  maximum  validity.  Horst  (1936)  proposed  a  method  which  takes 
into  account  the  intercorrelations  of  the  items  as  well  as  their  corre¬ 
lations  with  the  criterion.  Other  closely  similar  procedures  have  been 
described  by  Richardson  and  Adkins  (1938) ,  Toops  (1941) ,  Wherry  and 
Gaylord  (1946) ,  Gleser  and  DuBois  (1951) ,  Horst  (1956) ,  and  Horst  and 
MacEwan  (1956,  1957).  Lubin  and  Osburn  (1957)  reported  a  technique  of 
pattern  scoring  of  test  items  for  the  prediction  of  a  quantitative 
criterion.  Osburn  and  Lubin  (1957)  have  worked  with  a  method  whereby 
test  scoring  techniques  can  be  evaluated  to  see  if  they  have  maximum 
validity.  A  less  laborious,  though  analogous,  procedure  than  that  of 
Gleser  and  DuBois  (1951)  has  been  developed  by  Webster  (1956).  Webster's 
non-parametric  method  will  yield  dependable  results  for  dichotimized 


.q  t  b  02$I)  "col  Traill  3  ^r3  riJlv;  IX*tf  aa  Jaomla  eaJaiarao  3arf3 

”be  <  *  «  *  -3rl3  «i3**3  -  i  v  i  31  t  tilt  3 

^  ao 


•  rbsiq  srf3  ioI  affio.U  3ia3  lo  *'  .33aq 


61 


items  when  N  (observations)  is  large. 

Canonical  Correlation 

A  statistical  procedure,  seldom  mentioned  in  references  dealing 
with  test  theory  and  item  selection,  known  as  canonical  analysis  may 
be  an  appropriate  technique  that  should  be  applied  to  the  general  area 
of  item  selection.  Canonical  analysis  is  another  approach  to  multi¬ 
variate  analysis.  In  canonical  analysis  the  linear  combination  of  the 
dependent  variables  which  are  the  most  predictable  from  the  best  linear 
combination  of  the  independent  variables  is  found  (Cooley  and  Lohnes, 

1962) . 

Before  the  advent  of  large  and  fast  computers,  canonical  analysis 
was  far  too  time  consuming  and  involved  to  be  practical.  However, 
present  facilities  are  available  to  handle  the  tremendous  number  of 
calculations  involved.  Canonical  correlation  techniques  could  be  used 
to  find  regression  weights  for  the  items  and  criterion  variables.  Although 
items  with  low  regression  weights  can,  in  subsequent  analyses,  be  omitted 
from  a  test,  which  is  in  a  sense  a  means  of  selecting  items,  the  proce¬ 
dure  is  not  in  fact  well  suited  for  the  problem  of  selecting  items. 

The  merit  lies  in  weighting  the  variables  after  the  selection  of  items 
has  been  completed.  However,  when  many  items  are  used  to  construct  a 
test,  the  particular  weighting  system  is  not  too  important.  An 
important  aspect  of  using  canonical  correlation  techniques  is  that  a 
multi-dimensional  criteria  space  is  considered  in  selecting  weights  for 
the  predictors.  However,  while  the  linear  weighting  system  applied  to 
the  items  and  criteria  may  tell  us  something  about  the  items,  in  many 
situations  the  user  would  like  to  specify  his  own  linear  combination 


1C  si  j  r.  J  .  ■  0 


^  \ 


,  ■  • 


62 


of  criteria* 

Factor  Analysis 

Factors  can  be  conceived  of  as  the  principles  of  classification 
or  dimensions  that  allow  the  test  constructor  to  reconstruct  the  prop¬ 
erties  of  the  material  being  considered  rather  than  relying  on  subjective 
preference,  intuition,  or  common  sense  (Eysenck,  1966)*  Factor  analysis 
"enjoys  its  greatest  justification  as  an  exploratory  technique,  by  which 
the  variables  under  consideration  all  enjoy  a  reasonably  well-ration¬ 
alized,  but  not  necessarily  certain,  probability  of  belonging  to  the 
scientific  domain  of  interest,  and  the  structure  of  which  is  essentially 
unknown"  (Kaiser,  1966,  p0  361)* 

Items  may  be  sampled  after  a  simple  structure  factor  loading 
matrix  has  been  calculated.  The  items  having  the  highest  loadings  in 
each  factor  are  defined  as  the  best  measures  of  that  factor.  Several 
tests  can  be  formed.  Each  test  or  subtest  is  constructed  by  including 
those  items  with  the  highest  loadings  in  each  factor.  Since  the  con¬ 
structed  tests  will  be  mutually  orthogonal,  the  common  procedure  is 
to  form  a  battery  of  tests  (Horst,  1965,  1966). 

Scale  Analysis 

In  references  relating  measurement  to  scale  analysis  (Lingoes, 
1963;  Horst,  1965),  Guttman’s  name  is  frequently  mentioned.  Guttman’s 
(1955)  concept  of  a  perfectly  scalable  set  of  items  was  based  upon  the 
notion  that  all  persons  marking  an  item  with  a  given  preference  value 
would  also  mark  all  items  of  greater  preference  value.  Thus,  for  any 
particular  set  of  item  difficulties,  a  person  getting  a  more  difficult 


-noia  i-  »w  vldfiriojt.  s  Yot«©  11  j  j-  iiioo  *inu  <ir*Ldi\liuv  *rf3 

e *13  o3  gxxi  gaoled  io  \3l,  Xftdo:rq  rn  B3i»o  \1  ;3*aa own  Joa  3ud  .basil* 
vXl&i.  n...  a  ■  ■  ;i..ft  -*ri3  bt  <  <- n  r  i  ic  a.,  ft.  ob  di  tl^aaine 


..  o  3  '3JL  ft  is»  iu  b:  X  .  acf  Yec-  em»3l 


*b»lai  j:  )  i  93c:  naii  x^s3^. 


.3o33»l  3Bri3  lo  a^iues*  i  3£nd  ad)  eft  beaiiab  ai&  'loiOBl  do** 


.  (d8  [  «£dC;  ,3t.io H)  «3aft3  io  ic*®**^  s  crxoi  o3 


.  •  ay  m  i  as*  i  up  .  afii.  ■  i  ijftl  i  tftdaftiftl*!  aJ 


63 


item  correct  would  also  get  all  the  easier  items  correct .  The  resultant 
matrix  has  been  called  a  perfect  simplex*,  If  a  perfectly  homogeneous 
set  of  items  with  resulting  perfect  retest  reliability  is  available* 
one  item  is  of  as  much  value  for  measurement  purposes  as  the  entire 
set  of  itemso  However,  as  a  result  of  varying  item  difficulties  and 
measurement  errors,  the  ideal  test  item  is  not  available » 

The  concept  of  a  "universe  of  content"  in  relation  to  item 
construction  and  selection  in  Guttman's  scalogram  method  has  recently 
been  extended  by  Lingoes  (1963)  who  presented  a  completely  objective 
and  empirical  procedure  for  selecting  dichotomous  items  which  meet  the 
Guttman  scaling  criteria  in  multiple  dimension  situations.  Lingoes' 
method  involves 

selecting  an  item  from  the  set  to  be  analysed,  finding  that 

item  among  the  remaining  items  which  is  most  like  it  and  having 

the  fewest  errors,  determining  the  number  of  errors  between 

the  candidate  item  and  all  of  its  predecessors,  and  finally, 

applying  a  statistical  test  of  significance  to  adjacent  item 

pairs.  ...  All  items  are  forced  into  a  positive  manifold 

and  monotonicity  of  item  marginals  is  insisted  upon  (1963,  p.  502). 

No  references  or  examples  concerned  with  the  above  procedure 
have  been  found  by  the  writer  in  a  review  of  test  construction  pro¬ 
cedures*,  The  technique  appears  to  offer  a  new  apporach  to  selecting 
items  but  it  will  have  to  be  tested  empirically  prior  to  any  conclusive 


decision  regarding  usefulness. 


baa  esJtali  >±i*lb  mail  gnty  v  lo  Jit  sai  a  ea  .isvewoH 


yUnaDsi  earf  bodJara  oaigolaaa  a’nsmJJtio  fll  trojJDoiaa  bna  noiJauuanoo 


esvlovni  borfJem 


CHAPTER  VI 


THEORETICAL  DEVELOPMENT  OF  THE  SELECTION  TECHNIQUE 

For  selection  purposes  the  pool  of  items  may  be  considered  to 
be  responses  made  by  examinees  to  the  items .  With  knowledge  of  the 
response  to  an  item  and  the  associated  correct  response,  it  is  possible 
to  calculate  representative  summary  data  for  the  items.  A  common  analytic 
representation  of  the  relationships  between  items  is  given  by  a 
correlation  matrix  and  the  corresponding  array  of  item  means. 

Although  information  about  the  relationships  among  the  variables 
in  a  multiple  set  is  summarized  by  means  of  a  correlation  matrix  calcu¬ 
lated  from  multiple  measures,  the  problem  of  interpretation  and  summari¬ 
zation  is  encountered.  It  would  be  difficult  for  a  person  to  relate 
fully  all  variables  and  subsequently  provide  an  interpretation  of  all  the 
relationships.  Part  of  the  difficulty  arises  because  of  the  over-lapping 
nature  of  the  variables. 

Factor  analysis  can  be  used  to  approximate  the  original  relation¬ 
ships  among  variables  in  terms  of  a  smaller  number  of  basic  constructs 
called  factors.  The  dimensionality  of  the  factor  matrix  will  be,  in 
most  cases,  much  smaller  than  the  rank  of  the  correlation  matrix.  Con¬ 
cern  here  is  with  the  minimum  number  of  factors  and  not  with  their 
interpretation. 

The  problem  of  interpretation  can  best  be  handled  by  the  use  of 
Thurstone's  (1947)  principle  of  simple  structure.  If  the  simple 
structure  criterion  is  used  for  finding  the  factor  loading  matrix 
corresponding  to  a  correlation  matrix,  it  should  be  relatively  easy  to 


oXdia  al  di  ,s*iid  ra  >3  Ifaaidc  >  '■  islaoj.a*  arr3  ■  a  m»di  ok  cd  »anoqi?3; 

.  ai  ^dX  arid  iol  c  t«-:>  ysaniro*  avXdadflaee3q9i  rai  lolaa  od 
s  (d  amvX  l  aa  rdad  ae<-  U  no XdaXsi  arid  lo  noidBdfloasiqs? 

,;ieam  mrsd;  lo  q.,3  ;a  ;  rbroq^aiioo  rid  bna  xXidfi/n  ooXdflaraod 
as  d.jil:  iv  £>r  j  gno*,.  a  ;Xrf  no  .n  t  >  idoda  it  ok  Bin  cr  rigiK  1d.i  •- 


-b?Iso  xiddum  irotralridioo  b  lo  Mtaarc  yd  baaliamiwa  aX  dea  »Xq  dijuin  a  al 


adalo*  od  aoeiaq  a  iol  dluoilllb  od  bXuow  dl  .baisdmrooaa  a  noldas 
arid  XXb  lo  noXdadadqisdflX  na  abivoiq  yX3ni>i/p08dua  bna  aoldaliav  IIs  yXXui 

.tjidaliav  arid  lo  asudan 


~noXdaI&;i  IsxX'Xio  ©rid  >2.  ixoiqq  cd  bs«u  od  oao  aiay.  -as  dodoa 


.xlidacw  noXdsIaddoo  »rid  lo  ;4nai  arid  aarid  daiXax  a  doum  «®ae*a  daom 


iiari3  ridXw  don  ns  aiodoel  3o  ledinua  rauml  1  i  ar  i  f  tv  aX  *darf  mao 


i)  »tu  i-.  vd  belbn  rf  .-i  land  ^  d  noldada ^qiadfl  lo  r.  Idoiq  :T 


OTUdoirxde  alqmXa  lo  »X  ooliq  (VA9X)  a  saoJeiiur. 


xX  1  d sin  gflXbBoX  7od-*al  arid  soXbxiX*  do!  bssu  aX  acljsJlJO  ©mdox/dda 


65 


identify  meaningfully  the  factors.  Since  interpretation  rests  heavily 
upon  a  pure  definition  of  items  and  criteria,  the  problem  of  meaning¬ 
fulness  is  the  responsibility  of  the  user.  However,  the  problem  of 
defining  an  analytic  criterion  for  selecting  items  is  independent  of  the 
factor  interpretation  problem.  A  solution,  based  upon  psychometric 
principles,  is  outlined  below. 

Factor  Analysis  and  Prediction 

"Many  methods  are  available  for  predictor  selection,  but  in  general, 

these  have  not  used  the  factor  analytic  techniques  and  are  deficient 

in  that  they  capitalize  on  chance  error"  (Horst,  1965,  p.  22).  Only 

recently,  with  the  advent  of  electronic  computers,  have  factor  analytic 

techniques  been  applied  to  the  problem  of  predictor  selection.  Although 

it  has  been  claimed  that  multiple  regression  provides  the  best 

theoretical  answer  to  the  problem  of  developing  a  test  to  predict  a 

single  criterion  (Gulliksen,  1950  a),  Horst  maintains  that 

it  is  of  some  interest,  however,  to  see  not  only  how  the 
classical  methods  of  least  square  multiple  recti-linear 
prediction  can  be  brought  into  the  general  framework  of 
factor  analysis,  but  also  how  these  classical  methods  can 
be  modified,  and  perhaps  even  improved  by  formulating  the 
problems  in  terms  of  the  models  and  objectives  of  factor 
analysis  (Horst,  1965,  p.  540). 

A  distinct  advantage  of  factor  analytic  techniques  is  that  estimates 
of  error  variance  may  be  introduced  into  the  solution. 

Proposed  Selection  Technique 

The  item- selection  method  proposed  here  begins  with  a  principal 
axis  factor  analysis  of  the  matrix  of  intercorrelations  calculated 
from  predictor  and  criterion  raw  score  data.  One  solution  to  the  problem 


. 


3riJ  So  dir  has  bax  il  -e.ir.3lX  gaxdoelsa  iol  aol  ft!  to  jIi^.  sn  as  snlnlSsb 


dimiollsb  is  bas  eauplmfood  oUxIsns  lodoel  #rU  bseu  30 a  svsri  sasrfd 


.  (f;  >  ?i  T  j  >  '  a  t's  ■•■  1  -  '-■<■■•  <',*3  f  ^ 


N 


nol3  »sl38  lodolbs-rq  lo  maidoxq  »d3  od  lei Iqqs  aaad  e&LplarloaJ 


dsdd  enlsJnlBtn  daioll  f  (s  Om  ,r3»41II:xO)  noiisdlio  &±%al* 


etii  w  i  \{1  -  i o  -  :,1  t**7»vo,  Us  * -oe  So  .it  Jl 

^  »sii-.Uosi  elqldlum  sw  j  or^m  Iso*.' ■■■  .mio 

n  ,o  abort  darr;  Isois  si  >  e  rid  wof  oa.  .  iuri  .  <■!  X-^  i6 
odd  jfilasIrinoS  x  b  v  riq  il:  .«.  «-.  tc  ^  b-’fi  ,b  ,lS.  bom  sd 

,<OWi  .q  ,ed^X  « dedoH)  ala^Isos 


dsmideo  dsiid  el  lypiirrio;  1  Y>  1  '  >  •  a  vb  <n  all  t. 


.aotdoLoo  t-rfJ  Ola  iboi.ui  $d  x*^  *3  eiipv  iona  So 


< 


XsqtonlTq  s  ridiw  axil  I  ba?oqO‘jq  boddam  '  ilistn  -i  »t:T 

bo  el  lie  .o-Jibj  o  .  J.  in  sr.  J  So  aiay  £rOa  i  :  .  <b 

.Edsb  siooa  wsi  not is 3 Si 3  bus  lodolbaiq  moil 


66 


of  deciding  how  many  orthogonal  factors  to  extract  has  been  proposed 
by  Kaiser  (1960)  who  recommends  using  only  those  factors  with  corres¬ 
ponding  eigenvalues  greater  than  one  to  define  the  common  factor  space. 

The  rule  applies  only  if  unities  are  used  in  the  diagonal  of  the 

2 

correlation  matrix.  If  h^  represents  the  communality  of  variable  j_, 

2 


or 


the  remaining  1.00  -  hj  variance  of  item  is  the  unique  component 
residual  variance.  The  resulting  factor  solution  can  be  used  to 
calculate  jt,  a  reproduced  correlation  matrix,  that  is,  an  approxima¬ 
tion  of  the  original  correlation  matrix  R.  A  measure  of  the  lack  of 
fit  between  the  obtained  factor  model  for  the  domain  and  the  observed 
relations  among  the  variables  is  given  by  the  difference  between  JR 
and  R.  It  is  assumed  that  the  true  variance  is  free  from  error  or 
random  variation.  A  simple  structure  transformation  of  the  principal 
axis  factor  loading  matrix  is  often  required  for  purposes  of  inter¬ 
pretation.  A  major  objective  in  using  factor  analysis  is  to  be  able 

to  identify  and  eliminate  as  much  unsystematic  variance  as  possible. 
Secondly,  it  is  highly  desirable  to  have  projections  of  item  variables 
on  orthogonal  factors  to  represent  the  idealized  dimensions.  A  con¬ 
venient  rotational  procedure  to  achieve  a  simple  structure  approximation 
is  recommended  by  the  writer  prior  to  using  the  proposed  selection 
method.  A  previous  and  still  frequently  used  application  of  factor 

analysis  is  to  select  subsets  of  variables  such  that  each  of  the  simple 

structure  factors  will  be  adequately  represented  by  the  subset.  This  is 
essentially  the  common  procedure  in  selecting  tests  to  form  a  test 


battery. 

Consideration  must  also  be  made  of  criterion  measures  required  to 


baaoqoiq  naad  *i  1  3oai3xo  o3  aio3o«5  lanogorfaio  ijfliwn  wod  snibloab  5o 


.f  ftl&jki&v  io  vlirlafluaaoo  ed3  §3a*es> iqei  rf.  51 


■ 


*©IdJteao»  i  *-n  iiflv  olJt  n*3avj  u  £jtM  &&  «3  art  uJa  bm i  ^'Jas1!  03 

■  ;  .  rl.t  u  :  '  '  j  0  " 


ool3aaI®a  baaoqoiq  jrf3  jrtau  03  loiiq  ia3±3w  >d3  yd  b*ba*n  ooo'i  ai 


slqjaii  aci3  to  rijtsj  3t.il-  dor  Xd>  iav  ’io  aJ  .adw,  3o»is  o3  *1  ala^iam* 


3  .  < •»  j  i  o  «v  it  i  i  *  uh-coiq  ti  •  *>  v;  •  -A  e-3 


67 


establish  psychological  meaningfulness  for  the  test  under  construction. 

It  is  argued  that  even  though  ultimate  criteria  may  not  be  readily 
available,  the  test  constructor  must  have  some  notion  of  what  relatively 
unitary  skills,  aptitudes  or  abilities  are  required.  An  investigator 
must  locate  measures  which,  through  the  use  of  a  linear  weighting  system 
can  be  made  to  approximate  the  ultimate  criterion.  When  dealing  with 
several  criteria,  attempts  are  generally  made  to  combine  them  into  a 
single  criterion  measure  or  to  use  each  as  a  single  criterion  score. 

The  use  of  several  independent  criteria  provides  a  simplified  solution 
but  yields  less  information  than  a  composite  criterion.  As  in  the  case 
of  predictors,  it  is  possible  to  identify  the  common  factor  variance 
of  the  separate  criteria  by  means  of  factor  analysis.  Those  criteria 
with  high  factor  loadings  could  be  selected  to  represent  the  best 
measures  of  the  factors  in  the  criterion  space.  The  use  of  basic  stat¬ 
istics  to  determine  which  criterion  to  use  does  not,  however,  deal 
directly  with  the  fundamental  problem  of  relevance.  Some  decision  must 
be  made  by  the  investigator,  or  group  of  "experts",  as  to  the  relevance 
of  each  criterion  to  the  ultimate  criterion. 

When  questions  of  concurrent  and  predictive  validity  arise,  the 
first  concern  is  with  finding  a  suitable  criterion.  The  establishment 
of  content  validity  requires  no  measurable  criterion  since  to  assign 
content  validity  to  any  test,  is,  in  essence,  to  compare  idealized  course 
content  with  examination  content.  A  general  impression  formed  by  the 
writer  after  reading  references  concerned  with  various  aspects  of  validity, 
e.g..  Technical  Recommendations  (1954),  Cronbach  (1960)  and  Anastasi  (1961), 


is  that  little,  if  any,  concern  is  given  to  the  notion  of  predictive  test 


' 

OOIJBIOB  boimqaU.  •  8»tivoiq  .UellM  lo.»«»qSU.l  I«.vs«  »  ^ 

o,  «»la  moM*  eld.™-.  on  X^WI.v  *»«"»> 


68 


validity  until  after  a  test  has  been  constructed,  This  is  not  to  say 
that  there  is  no  thought  given  to  defining  the  end  product  in  terms  of 
some  set  of  criteria.  However,  the  implicit  under-lying  set  of  rules 
used  to  develop  a  test  should  first  be  made  explicit  and  formally  stated. 
This  is  done  at  times  in  constructing  items  when  "objectives"  and  "content" 
areas  are  used  to  specify  types  of  required  itemso  Important  preliminaries 
to  the  construction  of  any  test  should  be  a  specification  of  ultimate 
criteria  and  some  attempt  at  criterion  development. 

The  proposed  item  selection  technique  is  applicable  to  the  problem 
of  selecting  items  where  the  item  variables  can  be  described  in  terms  of 
a  known  factor  structure.  The  classical  situation  where  the  test  score 
is  a  linear  function  of  the  item  responses  will  be  considered.  An 
item  response  is  to  be  used  as  a  predictor  of  a  criterion  variable. 
Sampling  of  questions  is  not  considered.  Selection  of  items  is  res¬ 
tricted  to  a  design  problem.  The  proposed  item  selection  approach  is 
based  upon  the  assumption  that  the  items  and  criteria  have  a  known 
factor  structure  with  a  comparatively  small  number  of  common  factors. 

This  is  a  departure  from  other  selection  models  in  that  a  reduced  matrix 
of  factor  coefficients  is  used  as  a  starting  point.  The  rank  of  the 
reproduced  correlation  matrix  will  be  considerably  smaller  than  the 
order  of  the  original  correlation  matrix. 

General  Description  of  the  Selection  Procedure 

The  factor  analytic  procedure  used  by  the  writer  for  obtaining  a 
factor  matrix  from  the  correlation  matrix,  composed  of  items  and  criteria, 
was  a  principal  axis  factor  analysis  using  the  Householder  method  (1938) 
with  unities  in  the  diagonal  of  the  correlation  matrix.  A  primary  concern 


. 

[0  /  u  b.'id  -ij  '  -  b  o:>  ,  Jt*  ilia  ■  rev  t  or  i:  L 

, 

bsilupei  *Q  asq^i  x^loeqe  oi  bew  aia  aaaia 
®3i  i  3  u  lo  n  tolilaaqa  6  ad  bluorf  :  3a*3  ^  u  *lo  nolioi/x  eaoo  sri3  03 

^  X  I 

.i  Irfjsi  v  no  T-->  r:)  a  io  nc3D,cba:u:  £  aa  sau  o3  B.i 

el  rtoaoiqqa  noX3:aIda  mail  bsaoqoiq  arfT  .faaXdoiq  ngX  ab  £  03  baioXii 
avond  a  avarf  aliiaiiio  baa  a^eiX  sdi  3ari3  floliqmueaa  arfi  noqu  beaad 
.eiojoai  000003  So  isdmufl  XXa«b  ^Xavliaiaqaoo  a  riiXw  aTi/iowiia  ^o3oaS 


69 


at  this  stage  was  to  establish  the  number  of  significant  orthogonal 
factors,  according  to  Kaiser's  criterion  (1960)  of  retaining  those 
factors  with  eigenvalues  greater  than  one.  Kaiser's  (1958)  varimax 
criterion  was  applied  to  rotate  the  principal  axis  factor  matrix  to 
simple  structure «  The  procedure  outlined  above  has  been  used  commonly 
in  the  application  of  factor  analytic  techniques  to  the  isolation  and 
identification  of  a  limited  number  of  hypothetical  variables  under¬ 
lying  a  group  of  observed  variables. 

A  transformation  matrix  was  used  to  rotate  the  factor  matrix  such 
that  the  hypothetical  goal  test  vector,  specified  by  the  test  constructor, 
and  the  first  factor  were  collinear.  A  geometric  representation  of  a 
three-dimensional  orthogonal  factor  space  is  presented  in  Figure  1. 

Factors  represented  by  axes  I,  II  and  III  give  the  geometric  basis  for 
determining  the  location  of  the  goal  vector  (GV) „  By  assigning  relative 
weights,  x^,  to  each  factor,  the  location  of  GV  in  the  item  and  criterion 
space  is  specif iedo  The  axes  are  each  rotated  J)°,  as  illustrated,  to 
position  factor  I  collinear  with  GVo  Axes  I',  II'  and  III'  are  now 
the  frame  of  reference  for  the  orthogonal  factor  space.  The  loadings 
on  factor  one  then  represent  the  correlation  of  each  item,  as  well  as 
the  criteria,  with  the  goal  teste  With  the  direct  relationship  of  an 
item  to  the  goal  test  specif icied,  selection  of  items  to  construct  a 
test  is  initiated 0 

In  Figure  2  the  relative  positions  of  five  items  in  the  factor 
space,  the  loadings  on  GV  and  the  three  axes  are  illustratedo  Each  item 
is  numbered  according  to  the  order  of  selection  in  constructing  a  test. 
Items  4  and  5  would  not  be  selected  because  they  do  not  meet  certain 


» boo  nad3  *®*&s*s  ••ut4vn»flo  d3*v  e  or  ; 

noJtiaiJ  10  baf  t  31  *di  fll  VO  lo  no!3»oo I  ad3  ,103^»1  rfo&o  03  §1i  «s3  8  *T' 


n 


Figure  1.  Position  of  the  Goal  Test  Vector  in  the  Common  Factor  Space 


aomO  »dJ  lit  «*>»v  »»t  u«0  r,d>  So  aolix«<*  • 


Figure  2.  Relative  Positions  of  Item  Vectors  in  the  Common  Factor  Space 


eoftqg  l0J0^  nomroO  Ua  nl  .1033.V  Ml  io  4«o«i.o?  »vl3»IU  .  -  •»«» 


72 


specifications  that  will  be  explained  later o  The  first  item  selected 
would  have  the  highest  loading  on  factor  one.  In  sequence,  the  second 
item  selected  would  load  second  highest  on  the  first  factor .  From 
knowledge  of  the  locations  in  the  factor  space,  a  centroid  or  composite 
vector  would  be  formed  that  was  composed  of  the  first  two  selected  items „ 
The  first  two  items  selected  would,  in  most  cases,  have  high  loadings 
on  factor  one  which  would  account  for  the  greatest  proportion  of  each 
item's  common  factor  variance.  If  additional  items  were  to  be  selected 
in  this  manner,  a  problem  is  immediately  manifested.  When  the  first 
factor  loading  of  an  item  is  low,  it  is  possible  that  one  of  the 
second  through  m  extracted  factors  would  have  a  higher  loading  and  the 
item  would  not  represent  the  intended  relationship  to  the  goal  vector. 

A  more  general  consideration  would  be  to  determine  the  proportion  of 
variance  accounted  for  by  factor  one,  denoted  as  "true"  variance,  as 
compared  to  the  common  factor  variance  in  the  remaining  m  -  1  factors, 
here  referred  to  as  "error"  variance,  for  each  item.  If  an  item  is  to 
make  a  significant  contribution  in  the  construction  of  a  test,  more 
"true"  variance  than  "error"  variance  should  be  contributed.  The  "true" 
variance  represents  characteristic  item  properties  desired  by  the  test 
constructor  whereas  the  "error"  variance  is  undesirable.  Thus, 
consideration  is  given  to  selecting  only  those  items  whose  "true"  variance 
is  greater  than  the  "error"  variance.  That  portion  considered  to  be 
"error"  variance  in  one  application  of  the  selection  procedure  may 
represent  variance  components  of  other  tests  orthogonal  to  the  goal  test. 

The  angular  departure  and  the  correlation  relationship  of  an 
item  to  the  goal  test  vector  should  be  considered.  When  an  item  vector 


•d  03  b»i»W8BOO  aot330*  sarfT  9oa»l38v  "303i»  Mil  nsd3  e 


73 


o 

deviates  more  than  45  from  the  goal  vector,  the  variance  contributed 
to  the  goal  vector  will  be  less  than  that  to  tests  orthogonal  to  the 
goal  vector.  Thus,  the  item  would  not  be  considered  for  selection.  An 
item  correlation  of  less  than  .300  with  the  goal  vector  would  not  be 
considered  significant  because  the  additional  variance  contributed  by 
this  item  would  not  appreciably  add  to  the  specification  of  the  goal 
test.  A  final  item  characteristic  should  be  that  the  loading  of  an 
item  on  factor  one  be  greater  than  or  equal  to  .300  in  order  to  be 
considered  worthy  of  selection.  For  the  reasons  given  above,  items 
4  and  5  in  Figure  2  would  not  be  selected. 

After  consideration  has  been  given  to  the  various  restrictions 
to  be  imposed  prior  to  selecting  an  item  the  selection  of  items  continues 
until  the  desired  number  of  items  have  been  selected,  until  there  are  no 
items  remaining  in  the  pool  of  items  that  meet  the  imposed  conditions, 
or  until  the  test  being  constructed  deviates  in  composition  from  the 
prescribed  tolerance  limits. 

The  procedures  suggested  by  the  writer  in  the  above  presentation 
are  not  intended  to  provide  restrictions  upon  the  use  of  the  factor 
analytic  item-selection  algorithm.  Many  variations  would  be  possible  if 
various  types  of  factor  analysis  such  as  the  square  root  procedure,  the 
maximum  likelihood  solution  or  the  alpha  factor  analytic  method  replaced 
the  principal  axis  solution.  In  addition,  the  equimax  or  quartimax 
criterion  could  be  used  in  preference  to  the  varimax  criterion  example. 
Several  parameters  were  presented  above  to  suggest  a  maximum  acceptable 
angular  displacement  and  a  minimum  significant  correlation  coefficient. 
These  values  are,  in  the  writer’s  opinion,  merely  plausible  suggestions 


b.ludXlWoo  aonaXlav  aril  .loloa*  I»oa  aril  «**  "5»  SIOa  ”S»1V,b 

« 


aril  ol  Xanogorillo  Uw)  OJ  »*IJ  «M»»  «»t  •<*  ■Lt*  ™=  '*v  in3*  *‘l3  03 

nA  .noiloalaa  loi  b.iebi.noo  ad  ion  bXuow  oalX  aril  <•**«  ■«5»»  8 

*d  ion  bXuow  loloav  Xaog  aril  rflXw  OOt.  nadl  «-».i  lo  noil*. ■•' •■>*»=>  *' 

, avode  navlg  enoaaai  aril  iol  .nolloaXaa  lo  yriliow  baiabX.noo 

on  »l»  al.ril  XXlnu  .beloaXaa  n.ad  avari  aoall  lo  ladaarn  b.lXa.b  .<1  X  nv 


fanolllbnoo  ba.o^nl  aril  la.m  l.ril  .«*t  lo  Xooq  .ril  nX  gnlnXam.1  M*>t 

.nilaXL  bddlioe»iq 

aril  ..tob.DOiq  looi  aianp.  aril  ..  don.  al.*!*"*  i0  8UOi“V 

bao.Xqai  borilan  oliyXana  lolosl  adqia  aril  lo  nollnio.  booriiXariXX  «n«ixa« 

.aXqoaxa  noliallw  xaaliav  aril  ol  aonaialaiq  ni  baaa  ad  binoo  no,^ 
aXd.iq.bOE  a  l.ag^a  ol  aVod.  WlNH|  ai.ianai.q  Xal.v*. 

anolltaggua  eXdX.u.Iq  ,I.«-  .noXnXqo  •  '»#»  aril  ol  .ala  «»X« 


74 


and  may  thus  be  varied  to  meet  the  individual  test  constructors’ 
specifications,,  It  is  intended  that  each  test  constructor  will,  with 
minimal  effort,  be  able  to  modify  the  test  parameters  to  best 
represent  the  desired  characteristics  in  the  constructed  test. 


Mathematical  Description  of  the  Selection  Procedure 

Mathematically  the  procedure  for  selecting  items  can  be  stated 


as  follows. 

Let 

all 

a12 

•  •  a  3- 

lm 

i— i 

CN 

cd 

a22 

e 

*  *  *  a2m 

o 

A  - 

• 

• 

anl 

• 

Q 

an2 

• 

o 

•  •  •  Si 

nm 

C11  °12  *  *  *  °lm 


Ckl  °k2  ’  *  ‘  Ckm 


where  the  elements  a.s  are  the  factor  coefficients  of  n  predictor 

~ LJ 

variables  (items  in  our  case)  on  m  orthogonal  factors  and  c^_.  are  the 
factor  loadings  for  the  k  criteria  on  the  m  orthogonal  factors. 

Since  the  m  factors  of  the  space  span  the  space  and  are,  from 
the  users’  point  of  view  psychologically  meaningful,  the  object  test 


[U,  -,035U-U*003  JB9J  rfoua  3»<il  bsbnsaoi  fti  Jl 


75 


taken  as  a  linear  combination  of  the  m  factors  will  also  be  contained 
within  the  space  A^o  That  is,  the  object  test  will  be  contained  within 
the  same  space  as  the  common  parts  of  both  item  and  criterion  factors. 
The  exact  linear  combination  of  the  m  factors  required  to  define  the 
object  test  vector  may  be  specified  as 


which  has  as  elements  the  relative  weighting  system  to  be  applied  to 
the  m  orthogonal  factors. 

In  order  to  determine  the  perpendicular  projections  of  each  item 
vector  upon  the  object  test  vector,  it  is  desirable  to  place  any  one  of 
the  m  orthogonal  factors  collinear  with  X-  Arbitrarily,  the  first 
vector  may  be  selected  and  positioned  by  defining  an  appropriate  trans¬ 
formation  matrix  T.  to  be  applied  to  the  matrix  A. 

Specifying  the  normalized  vector  X  as  _t  and  appending  it  to  the 

matrix  A  such  that 

a 


t 


OS  be.  L;qa  «f  oJ  laJa^a  so*3****"  sdi  BJnsoolJ.  sa  aari  daulw 


i  3 1  liana  to  anolSdstoiq  it  li/albn&qaaq  »ii3  »-*  k  »1  o  b 


76 


it  is  required  that  the  transformation  matrix  be  such  that 

(a)  AT  =  S  where  r  .  =  1  and  r  .  =  0  for  r  =  2,  3,  . . .  m  (r  *  t) 

—  —  — s  t  —  —  —  — 

1  r 

(b)  and  that  T  T'  =  I,  that  is  is  an  orthonormal  trans¬ 
formation  matrix  performing  an  orthogonal  transformation 
on  A. 

Such  a  transformation  matrix  may  be  generated  in  a  number  of  ways,  per¬ 
haps  the  simplest  being  as  follows. 

Let 


X1 

X2 

•  •  • 

X 

m 

x2 

x3 

•  •  • 

xi 

• 

• 

• 

• 

• 

• 

•  • 

• 

X  ..  X 

m-1  m 

•  •  • 

x  0 
m-2 

X 

m 

xi 

•  •  • 

x  i 

m-1 

and  apply  the  Gram-Schmidt  orthonormal  process  (Hohn,  1964,  pp.  264  - 
267)  to  X*  starting  with  the  column  vector  X ^  in  forming 


and  for  the  r/  column  vector  of  T,  is  given  by 


- 


-ansiJ  iBBtionofilio  hb  ai  X  «1  “  *XJ  *****  ;fcnB 

-  ••  ■  :  :  .  '  * 


.A  no 


-isq  ,«Y*w  *0  lddoun  &  nX  bsinisnsg  »d  X*®  xlUBB  no±3Bflaolanai3  8  dzuZ 

i  ;•  ■  d  J  ' 


01 


. 


_  [  i  •  «  ■>  r 


I-m  >. 


i  at 


' 

i 


rii 


,  3®v  now  lop 


77 


T 

r 


X 

r 


r-1 

Z 

k=l 


(Tk  Xr> 


X 

r 


r-1 

Z 

k=l 


(T£  Xr)  T 


where  r_  -  2 ,  3 , 


m 


yields  the  remaining  column  vectors  of  T.. 

The  li  +  k  +  1  vector  of  A  is  defined  as  the  vector  .  The 
matrix  TT  may  now  be  used  to  rotate  the  matrix  A  such  that  the  vector 
with  projections  S!^  of  is  collinear  with  the  column  vector  T_^. 

Let 


— 

— 

' 

all 

a12 

*  *  ‘  alm 

£u 

C12 

*  *  *  tlm 

S11 

S12 

*  *  ‘  Slm 

a21 

a22 

*  *  *  a2m 

t21 

t22 

*  *  *  t2m 

S21 

S22 

*  *  *  S2m 

• 

• 

• 

• 

o 

• 

= 

• 

• 

• 

• 

• 

• 

• 

• 

• 

• 

• 

• 

• 

• 

• 

• 

• 

0 

• 

• 

• 

anl 

an2 

•  •  •  Si 

nm 

Sul 

tm2 

.  .  .  t 

mm 

s  i 

nl 

sn2 

•  •  •  s 

nm 

C11 

c12 

#8« 

lm 

C21 

• 

c22 

e 

©  •  •  C  r\ 

2m 

• 

• 

• 

• 

• 

• 

• 

• 

• 

• 

• 

© 

• 

© 

• 

• 

cu. 

Ck2 

O  •  •  C- 

km 

s  i 

gl 

s  0 
g2 

•  •  •  s 

gm 

Z11 

Z21 

*  *  *  Snl 

shi 

Sh2 

•  •  •  s , 

nm 

A  1  =  S 

where  £.  =  n  +  k  and  h.  =  g  +  1.  Since  T  is  orthonormal  s^  =  1.000  and 
sh2  through  s ^  are  equal  to  0.000. 


«  •  '  .  '  ‘  :! 


cal 


V  •  * 


axrf 


78 


It  would  be  desirable  that  the  sums  of  squares,  SS^,  be  as  follows 

n 

SSr  =  £  S.  (r  *  1)  be  such  that 

L  -.v- 

3=1  J 

SSf  >  SSj  r+1  (r  *  1). 

If  this  condition  were  attained  the  tA  -  1  remaining  vectors  of  S_  would 
account  for  decreasing  importance  in  redefining  the  original  space 
.  Since  the  first  vector  of  S_  represents  the  object  test,  in  the 
same  sense  the  remaining  m  -  1  vectors  of  _S  represent  'other  tests’ 
orthogonal  to  the  object  test.  Since  an  item  does  not  contribute  solely 
to  a  single  object  test  but  also  to  all  other  tests  orthogonal  to  it 
in  the  space  (except  in  the  case  of  the  item  being  perfectly  ortho¬ 
gonal  to  one  or  more  S^. ,  r  *  1) ,  the  contributors  of  the  item  to  the 
remaining  orthogonal  tests  should  also  be  assessed. 

Therefore  it  is  desirable  to  arrange  our  transformation  such  that 
the  orthogonal  tests  account  for  decreasing  amounts  of  the  remaining 

common  item-criterion  variance.  Thus,  we  now  adjust  the  S0  through  S 

— z  — m 

column  vectors  of  _S  to  have  decreasing  amounts  of  variance  accounted 
for  by  each  factor.  Let 


£11 

f  12 

•  •  •  f 

IP 

S12 

S13 

•  •  •  s 

lm 

f  21 

f  22 

•  •  •  f  o 

2p 

S22 

S23 

•  •  •  s  ^ 

2m 

• 

• 

• 

• 

• 

• 

• 

• 

• 

• 

• 

• 

• 

• 

• 

• 

• 

• 

_fhi 

fh2 

•  •  •  f  i 

hP_ 

Sh2 

Sh3 

•  •  •  s . 

nm 

where  £  =  m  -  1.  Considering  again  an  orthogonal  transformation  matrix 


3sri3  dou9  ©d  (1*1) 

1  v  x-t 

bioow  8  io  «/iol3»v  8n±n±BTO9i  I  -  m  srfl  bonJtsJ:)*  aiw  nolJibnoo  airfJ  31 
dDBqa  X^aXgXio  Srf3  gnXniXsbsx  nX  sonai^oqflii  gni 8sa?D9b  io*i  XmiODOB 
9ri3  al  ,3893  309tdo  9ff3  %3<j99#*q»T  &  Xo  roiosv  3e:rXX  srii  9Dn±8  .^A 

»e3B93  isri^o’  Xnseesqs*  3.  *o  nioioev  I  -  ffl  gnlnlsirai  ©fix  ©an9a  ©msa 

| 

ji  ’ 

9rf5  03  ms3X  ©ri3  Xo  a:ro3i/dXT3croo  ari3  , (I  *  l  .-JL  ^oa  io  ©no  o3  xanog 

.b9jB89^86  9d  obIb  bluorfa  83893  lBnogorf33o<  gnXnXBmsT 

3Brf3  lioos  noX3afirxol8nB‘i3  tuo  sg n&ns  03  aldaiXasb  aX  3X  sioXaiarfT 

2  ifriioiifc  r2  ed3  381/0  won  ©w  ,B«riT  .sonaXiav  noli93Xao-fl»93X  noonnoo 

i.r 

bsJnoooo*  .anrtw'o  ,1W»  |to*»=  ™rf  0 

ti-J  :  F.X  /i  \,{  ; 


79 


E  such  that 


(F  E)  '  F  E  -  X, 

-  .  ^m-1  * 

it  can  be  seen  that  E_  are  the  eigenvectors  of  F_  _F'  and  X  their 
associated  eigenvalues,  i0eo 

E'  F'  F  E  =  X 

F'  F  =  E  X  E' 


The  matrix  IS  will  provide  a  transformation  for  which  the  m  -  1  remaining 
vectors  of  ¥_  will  be  of  decreasing  importance  in  terms  of  accounted 
variance.  The  m  -  1  remaining  vectors  may  be  considered  'concomitant 
tests'  orthogonal  to  the  goal  test  each  of  decreasing  importance. 
Multiplying  so  that 

F  E  =  D 


will  yield  sums  of  squares  of  decreasing  order  for  the  I)  matrix.  The 
elements  of  13  are  appended  to  the  first  column  vector  of  to  produce 


sn  :  dn 

• 

d12 

9  9  9  Ci- 

lP 

—  — 

s2i  :  d2i 

d22 

0  9  9  dr> 

2P 

s  :  d 

-1  o 

= 

©  9  • 

e 

© 

9  9  9 

©  «  © 

© 

shi  :  dhi 

e 

• 

dh2 

• 

s 

9  9  0  d, 

hp 

The  object  test  vector  is  defined  as  the  row  vector 


[l.OOO  0.000  0.000 


0.000 


■] 


Our  task  now  is  to  select  predictor  variables  from  the  n  row  vectors  of 


(S'!) 


k  -  3  %  *a 

5 


oe  ttii^XqJtlXuK 


■ 


qrf 


80 


11 

b12 

e  •  o  t)-, 

1m 

21 

b22 

o  o  O  b  r\ 

2m 

bnl  bn2 


nm 


bhl  bh2  *  ‘  •  bhm 


such  that  the  centroid  of  the  selected  vectors  will  be  nearly  collinear 
with  the  object  vector. 

Appendix  A  contains  a  flow  chart  of  the  item-selection 
algorithm. 


Criteria  for  Item  Selection 

To  this  point  the  theory  indicates  that  a  precise  formulation 

of  a  goal  test  can  be  specified  within  the  space  A^  and  in  such  a  manner 

has  to  have  psychometric  meaning.  The  selection  of  items  to  approximate 

the  object  test  is  now  required. 

From  the  geometry  of  the  space  A^  containing  n_  +  k  +  1  vectors, 

the  following  selection  procedures  appear  reasonable: 

(a)  At  any  stage  of  selection  the  correlation,  r  ,  between 

gc 

the  goal  test  and  the  composite  vector  should  be  a  maximum  where 

represents  the  goal  test  and  c_  the  item  composite  approximation  to  g_. 

2  2 

Since  £_  is  of  unit  length  (h  =  1.000)  while  h  <  1.000  and  in  the 

g  c 


9JI  !wMtq«i  03  -ai l  1c  «fT  .8nin»9*  5  r.  n  Xoveq  ;  ot  sari 


art!  ni  bnt  000. X  i  ^  .Xi.Xw  (000. X  «  ^0  <I3»noX  tin o  lo  ai  *  »oni2 


81 


2_ 

empirical  case  h  <■  1„000,  and  since 


r  =  h  h  cos  0 

gc  g  c 

r  =  h  cos  0 

gc  c 

for  a  given  0_  selecting  items  in  terms  of  decreasing  communality  will 

produce  a  decreasing  r  .  Similarly,  for  a  fixed  h  selecting  on  the 

§c  ^ 

basis  of  largest  to  smallest  0_  will  have  the  same  effect.  Selecting 
on  both  h^  and  0_  results  in  the  same  trend.  Thus,  by  this  procedure  a 
negatively  sloped  function  for  r^  is  to  be  expected. 

(b)  Select  all  items  that  have 


m 


:  2 

b .  >  Z  b . 

ll  0  lr 

r=2 


which  is  equivalent  to 


,_2  .2  .  2 
(hi  -  hn)  <  bu 


(SC2) 


Since 


cos  0 


ij 


hu  h „ 
1  J 


r .  . 

ij 


can  be  reduced  to 


h .  cos  0  .  .  =  bM 
i  ij  il 


because  of  the  unique  properties  of  the  object  test  vector,  the  above 
equation  relating  communality  to  the  first  factor  variance  can  be 


written  as 


2  2 
(1  -  cos  0  ,  „ )  <  cos  0  .  . 

13 


which  can  be  transformed  to 


0.5  <  cos  0 .  „  . 

13 


•  803  9ri  s*  -  3#S 
l  IP  I  I  6  ,0:>  3rf  “V 


ci  3n  »Ia/  o  s  p.  ri-kirfw 


nd  ’  (j  ■'  - 1° 


ti9  ,OD 


3V0de  9/13  f  303 39V  3893  339tdO  9tl3  lo  8si319q03q  9opiOU  9»l3  0  98U839 


82 


Thus,  given  the  condition 


b 


2 

il 


m 
>  I 

r=l 


y 


the  restriction 


0  <  e . .  <45 

ij 

is  imposed. 

2 

If  an  item  h„  was  very  low,  even  though  the  above  condition  was 

met,  the  item  would  be  included „  This  would  not  be  a  fully  adequate 

2 

criterion  since  if  h_  was  low  the  associated  item  should  not  remain  in 
the  pool  or  space  A^.  The  specification  of  a  minimum  acceptable  value 
SCI  of  b^  would  solve  the  commonality  problem  presented  above. 

(c)  Termination  of  the  selection  method  is  dependent  upon 
additional  stop  criteria  specified  by  the  user.  Selection  of  items 
will  be  discontinued  when 

1.  the  correlation  between  the  composite  vector  and  the 
object  vector  is  less  than  SC3  or 

2.  a  maximum  angular  departure  of  the  composite  vector 
from  the  object  vector  is  greater  than  SC4 ,  or 

3.  there  are  no  further  items  remaining  after  meeting 
the  conditions  imposed  by  SCI  and  SC2,  or 

4„  the  number  of  items  desired  by  the  test  constructor 
have  been  selected. 

Each  test  constructor  must  provide  values  for  SC3  and  SC4  which  act  as 
a  "stop  criterion"  for  the  itemf-s  elect  ion  process. 


loOuMenoa  i  ■*>  »dJ  t«Jl  o  i  ma  an 


83 


Validity  and  Reliability  Estimates 

The  test  validity  for  n_  selected  items  can  be  estimated  by 
calculating  a  correlation  coefficient  to  determine  the  relationship 
between  the  composite  test  vector  and  the  goal  test  vector.  It  is 
assumed  that  the  goal  test  vector  represents  the  composite  criterion. 
Classical  test  validity  methods  presented  in  the  review  of  the 
literature  on  measurement  theory  are  not  fully  appropriate  in  the 
present  situation  although  the  notion  of  correlating  a  series  of 
predictors  with  a  criterion  score  is  retained.  The  defined  test 
validity  coefficient  most  appropriate  in  relation  to  the  proposed 
method  is  the  correlation  expressing  the  relationship  between  the 
composite  test  vector  and  the  goal  test  vector. 

A  centroid  of  any  n_  selected  items  locates  the  composite  test 
vector  in  the  item  and  criterion  space.  The  centroid  would  have  m 
co-ordinates 


n 

Z 

k=l 


n 


Z 

k=l 


n 

Z 

k=l 


By  restricting  the  number  of  factors  to  m,  a  limitation  is  imposed 
upon  reproducing  the  original  correlation  coefficient  between  two 
vectors.  Since  the  validity  coefficient  defined  above  is  based  upon 
the  "goodness  of  fit"  of  one  vector  to  the  location  of  another  vector, 
the  validity  coefficient  may  be  thought  of  as  a  coefficient  of 
reproducability .  The  validity  coefficient  is 


r 


val 


h  cos  0 
c  eg 


.  . 

V  b  l*Ki|i83  ad  I BD  ■  w:  bs  uIsb  n  30i  tf-T*-’?  ienj  sx 


llojo  ,v  jaa3  Isos  arid  but  lodosv  dead  edlaoqmoa  d  n  w3l'! 


«  di  lo  wrVMt  arid  ul.  bsdim^id  &boridaffl  vdibli<v>  di -t«i  a 


I-af 


84 


where  9  is  the  angle  between  the  composite  test  vector  and  the  goal 
test  vector  and  h^  is  the  length  of  the  composite  test  vector. 

A  relatively  simple  procedure  is  available  for  calculating  a 
test  reliability  coefficient.  Cronbach  (1951)  has  shown  that  one  of 
the  Kuder  and  Richardson  (1937)  formulas  gives  the  mean  of  the 
correlations  resulting  from  all  possible  ways  of  splitting  a  given 
test  into  two  halves  and  that  it  gives  the  proportion  of  first-factor 
variance  extracted  from  the  intercorrelations  of  the  test  items.  Thus, 


2 


yields  an  estimate  of  test  reliability. 

The  internal  consistency  reliability  coefficient  can  be 
considered  from  two  points  of  view.  If  the  projections  of  the  item 
vectors  are  on  the  goal  test  vector,  one  estimate  is  available  regarding 
proportion  of  variance  associated  with  a  given  criterion.  However, 
if  projections  of  item  vectors  are  onto  the  composite  test  vector, 
the  internal  consistency  coefficient  is  then  truly  an  estimate  of  the 
constructed  test's  internal  consistency  variance.  Depending  upon  the 
interpretation  desired,  either  coefficient  would  be  suitable.  Thus, 
it  may  be  advantageous  to  calculate  both  internal  consistency 
reliability  coefficients  and  subsequently  label  each  according  to  the 
vector  representation.  Projections  upon  the  composite  test  vector 
are  easily  found  by  using  the  normalized  centroid  locations  of  the 
composite  test  vector  as  a  transformation  matrix  to  rotate  the  item 
vectors  such  that  the  composite  test  vector  and  factor  one  are  collinear. 


. W9XV  io  I  Jrrloq  ow3  -noli  503 

gnltiii  a  aids  Us  vb  si  Jsm  3 as  sno  ,1cjdqv  3esJ  iaos  dri3  no  bib  b*o3dsv 


85 


A  matrix  A  of  the  factor  loadings  for  all  n_  selected  items  is  rotated 
by  V,  a  vector  composed  of  the  normalized  centroid  locations  of  the 
composite  test  vector  to  form  C_  which  is  a  column  vector  containing 
the  projections  of  the  item  vectors  onto  the  composite  test  vector. 

If  the  composite  test  vector  was  normalized,  a  procedure 
analogous  to  the  correction  for  attenuation  of  a  correlation  coefficient 
would  result.  The  validity  coefficient  would  then  be 

r  ..  =  cos  0 
val  eg 


Worked  Example  of  the  Selection  Technique 

The  following  is  a  sequential  step  by  step  worked  example  of  the 
proposed  analytical  method  for  selecting  items.  It  is  assumed  that  the 
items  have  been  previously  written  and  administered  to  a  large  sample  of 
subjects.  We  now  start  with  the  factor  pattern  of  the  items  and  criteria 
where 


0.440 

0.320 

-0.080 

0.460 

0.190 

-0.610 

0.840 

0.030 

0.020 

0.640 

0.460 

0.190 

0.560 

-0.600 

0.340 

0.550 

-0.390 

0.220 

0.590 

-0.100 

0.120 

0.450 

0.480 

-0.700 

0.400 

-0.500 

0.100 

Items 


Criteria 


The  assigned  weights  are,  respectively, 


5.000 


2.000 


1.000 


' 


3iub93oaq  , b©xl L  mon  *•-«/  . c  ja  r  <sel  -  oqi  oj  sdJ  II 


}r:»jfcollI®OD  oi3sIdi  oo  r»  io  nolla  naaja  10I  aois>97'X0d  arfJ  oJ  euogolsflfi 


srf;j  io  slqflKXo  bb  ow  <5  >d  ;d  qs^a  iBiinsupea  £  eJ  gnifoliol  sriT 


,976  a^rigl^w  i 

86 


which  is  the  column  vector 

X^ .  We  now 

form  X 

5.000 

2.000 

1.000 

2.000 

1.000 

5.000 

1.000 

5.000 

2.000 

and  apply  the  Gram-Schmidt  orthonormal  process  to  X,  starting  with  the 
column  vector 


0.913 

0.365 

0.183 

_ 


which  is  used  as  a  basic  reference  point  to  calculate 


T  = 


0.913 

0.365 

0.183 


-0.185  -0.364 

-0.030  0.930 

0.982  -0.040 


The  matrix  T_  is  now  used  to  rotate  the  matrix  A  (AT  =  S)  such  that  the 
column  vector  represented  by  Sn  of  S  is  collinear  with  the  column  vector 


0.440 

0.320 

-0.000 

0.460 

0.190 

-0.610 

0.840 

0.030 

0.020 

0.640 

0.460 

0.190 

0.560 

-0.600 

0.340 

0.550 

-0.390 

0.220 

0.590 

-0.100 

0.120 

0.450 

0.480 

-0.700 

0.400 

-0.500 

0.100 

0.913 

0.365 

0.183 

0.913 

-0.185 

-0.364 

0.365 

-0.030 

0.930 

0.183 

0.982 

-0.040 

Items 

Criteria 

Object  Test 
T 


0.504 

-0.169 

0.141 

0.378 

-0.690 

0.034 

0.781 

-0.136 

-0.279 

0.787 

0.055 

0.187 

0.354 

0.248 

-0.776 

0.400 

0.126 

-0.572 

0.524 

0.012 

-0.313 

0.458 

-0.785 

0.311 

0.201 

0.039 

-0.615 

1.000 

0.000 

0.000 

A 


S 


*10  3  09V  nrauloo 

£d£.  0 
£81.0 


3  o  i  »cj  3DHi  i  o.  a.sd  .  as  fc  ni  «i  r>  i- ii *«. 

eat.o 
£81.0 


ad?  J£iii  rfou a  (_2  -  T  )  A  >:x  x3f.iix  sri3  s^£Jo  oJ  bean  won  sJ  :  *3ar 
10339V  amuloo  9ri3  ri3iw  MsntHoD  ei  8  lo  8  xd  bajMMTq  K  > 0399V  nmuloo 


IIC.O 

000.0 

£81.0 

000.0- 

OAA.O 

Old.O- 

oex.o 

Odi.O 

0A8.0 

oee.a 

oe^.o 

87 


The  sums  of  squares  for  columns  1,  2  and  3  in  matrix  S_  are,  respectively 
3 .437  1.221  1 . 636 

A  transformation  is  now  carried  out  to  rotate  the  second  and  third 
columns  of  S_  so  that  the  second  column  will  account  for  the  maximum 
amount  of  variance  possible  in  an  orthogonal  space  of  the  two  vectors. 
When  this  has  been  done,  the  third  column  contains  the  remaining  portion 
of  the  variance  not  accounted  for  by  the  second  factor. 

The  rotated  matrix  D_  is  now  appended  to  the  column  vector  _S^ 
to  form  B. 


I 

II 

III 

1. 

0.504 

-0.212 

-0.060 

2. 

0.378 

-0.418 

-0.550 

3. 

0.781 

0.153 

-0.270 

Items 

4. 

0.787 

-0.123 

0.151 

5. 

0.354 

0.780 

-0.234 

6 . 

0.400 

0.543 

-0.219 

7. 

0.524 

0.265 

-0.167 

cr 

0.458 

-0.700 

-0.472 

Criteria 

C2 ' 

0.201 

0.529 

-0.315 

GV. 

1.000 

0.000 

0.000 

Goal  Test 

The  respective  sums  of  squares  for  the  above  columns  are 
3.437  2.003  0.854 

which  total  to  the  same  amount  as  in  S_  but  we  now  have  each  factor 
accounting  for  a  decreasing  amount  of  variance.  In  matrix  B_  we  have 
seven  item  vectors,  two  criteria  vectors  and  a  goal  vector. 


.  Jjtsioq  §n .-r.tr,  9i  i  i  isjIo  bit.  3  aril  ,ei  ib  nasd  8£vrf  elri3  ntti* 


L  V  . 0 


88 


A  test  constructor  must  specify  various  parameters.  The  values 
used  in  this  example  are  as  follows : 


SCI  =  0.200,  SC2  =  for  each  respective  item  vector, 
SC3  =  0.30,  and  SC4  =  45  degrees. 

"Error"  variance,  e.,  is  defined  as 


where  h^  is  the  communality  of  the  centroid  vector  and  represents 

the  variance  accounted  for  by  the  first  element  of  the  centroid  row 

vector.  Items  4  and  3  are  first  selected  because  they  have  the  largest 

b.,  values  of  all  items  available  for  selection 
— xl 


Item  3 . 

0.781 

Item  4 . 

0.787 

Sum  of  items 

1.568 

Centroid 

0.784 

Centroid  variance 

0.615 

Communality 

0.619 

h. 

l 

0.787 

cos  0 

0.784  / 

e. 

l 

0.004 

0.153 

-0.270 

-0.123 

0.151 

0.030 

-0.119 

0.015 

-0.060 

0.000 

0.004 

0.787  =  0.996;  6  =  4.465  degrees 


The  correlation  of  the  composite  vector  with  the  goal  vector  is  0.784. 
Since  the  stop  criteria  are  not  applicable  at  this  stage,  we  now 
proceed  to  select  another  item.  Items  2,  5  and  6  are  rejected  because 


. 

wc  '  :  .  .1  >:•  s  M  Iq;;:6x-.  hl.t  r..r  o&eu 


. 899139b  2  A 

■  .  ..  ;  :  * 


k" 


. 

-i9D  drizf  io  Mansis  381X5  arf3  to?  balm/oooB  aonaXisv 
leagiBi  ®ri3  sysri  savj^osd  ha  Jos Isa  leiXX -.aiB  £  bne  A  amsll  . io3o9v 

. 


0  .0- 

eex.o 

xei.o 

en;  o- 

.  •,, 

AOO.O 

000.0 

■ 


401*0 

Tay.o 

i  h  \  .  ; 

..  e  .  :  It  ,  . 

I'  -G-rUtoO 

t  % 

eia.o 

.  />  .oi 

- 

^3llf^ninio0 

V 

l9 

■ 

:n  i  1  }£»-  a-i  ?  h  i  ?  ,.{.«!  .,  o-3l  i*  1  >r  :.’0‘  ae  :  ocn  [ 


89 


in  each  case  h^  -  b_^  is  greater  than  b^.  Thus  items  1  and  7  remain 
to  be  selected  from  the  pool. 


Sum  of  items 


previously 

1.568 

0.030 

-0.119 

1.568 

0.030 

-0.119 

selected 

Item  1 . 

0.504 

-0.212 

-0.060  Item  7. 

0.524 

0.265 

-0.167 

Sum  of  items 

2.072 

-0.182 

-0.179 

2.092 

0.295 

-0.286 

Centroid 

0.691 

-0.061 

-0.060 

0.697 

0.098 

-0.095 

hii 

0.4848 

0.5044 

hil 

0.696 

0.710 

cos  0 

0.691  / 

0.696 

=  0.993 

0.697 

/  0.710 

=  0.982 

Since  the  addition  of  item  1  to  the  composite  vector  will  reduce  the 
angular  departure  of  the  composite  vector  from  the  goal  test  vector 
more  than  item  7 ,  item  1  is  now  selected .  The  intermediate  summary 
data  is  tabulated  below. 


Sum  of  items  previously  selected 

1.568 

0.030 

-0.119 

Item  1 

0.504 

-0.212 

-0.060 

Sum  of  three  items 

2.072 

-0.182 

-0.179 

Centroid 

0.691 

-0.061 

-0.060 

Centroid  variance 

0.477 

0.004 

0.004 

Communal ity 

0.485 

h 

i 

0.696 

cos  0 

0.691  /  0.696  = 

0.993;  0 

e. 

0.008 

=  7.031 
degrees 


The  correlation  of  the  composite  vector  with  the  goal  vector  is  0.691 


. 


SZl.O-  S^O.S:  <s  ^Jl  lo  mu3 


Ck.O-  IdO.O-  ICd.O 

dee.o 

■xoJ^v  Ji  3  Itog  ach  molt  iq3d*v  sal  qmoo  artt  lo  aio^Teqab  lalugflB 


.  \>e  yieu  y  q  eTiaJJt  lo  mui. 

I  01931 


' 


9^aaii£v  bloiJn9D 


.]  >d.O  el  io3a:  v  Iso-  «rij  rij  i  7  soJosv  tlleo  noo  aril  lo  nol3Bl9*xi09  9/fT 


90 


One  item  remains  available  for  selection  purposes. 


Sum  of  items  previously  selected 

2.072 

-0.182 

-0.179 

Item  7 . 

0.524 

0.265 

-0.167 

Sum  of  four  items 

2.596 

0.083 

-0.346 

Centroid 

0.649 

0.021 

-0.087 

Centroid  variance 

0.421 

0.000 

0.008 

Communality 

0.429 

h. 

i 

0.655 

cos  6 

0.649  / 

f  0.655  = 

0.991;  0 

e. 

i 

0.008 

=  7.798 
degrees 


The  correlation  of  the  composite  vector  with  the  goal  vector  is  0.649. 

As  there  are  no  items  remaining  that  meet  the  specified  criteria, 
in  terms  of  the  parameters  set  by  the  user,  the  selection  procedure  is 
terminated . 

The  test  constructed  from  four  items  selected  in  the  example 
would  have  a  validity  of  0.649.  When  the  composite  test  vector  is 
normalized,  the  test  validity  becomes  0.991.  By  using  the  position 
of  the  centroid  calculated  for  the  constructed  test,  the  factors  can 
be  proportionately  weighted  as  before  when  using  a  transformation  matrix. 
The  composite  test  vector 

£o.649  0.021  -0.086 

after  normalization  is 

0.991 


0.032 


-0.131 


dlqciBXd  9riJ  nr  bed  t  ll  fc  n  lenoa  ari 


.e^d.O  lo  6  sverf  bioow 

.iep.0  asooosd  xJXbilav  Js»J  sdl  .basiXannon 

- 


1  f.  oq  ri'V 


r.i  SOJ  i  ’  ‘ 


- 


91 


Items  selected  from  the  matrix  of  n  items 


(1) 

0.504 

-0.212 

-0.060 

(3) 

0.781 

0.153 

-0.270 

(4) 

0.787 

-0.123 

0.151 

(7) 

0.524 

0.255 

-0.167 

(3V) 

0.991 

0.032 

-0.131 

which  when  postmultiplied  by  the  normalized  column  vector 


0.991 

0.032 

-0.131 


yields  the  factor  loadings  of  each  selected  vector  on  the  composite 
test  vector.  Loadings  on  the  composite  test  vector  are 


(1) 

0.501 

(3) 

0.815 

(4) 

0.756 

(7) 

0.549 

(GV) 

1.000 

which,  excluding  the  last  element,  has  a  total  variance  of  1.787.  The 
proportion  of  variance  accounted  for  on  the  composite  test  vector  is 
0.447  which  is  the  internal  consistency  reliability.  If  the  internal 
consistency  reliability  is  calculated  using  the  goal  test  vector  as  the 
location  of  the  column  vector,  the  reliability  coefficient  is  0.440. 

It  must  be  remembered  that  the  reliability  and  validity 
coefficients  presented  above  are  those  defined  in  relation  to  the 
presented  analytical  item- select ion  model.  Although  the  notion  of 
reliability  and  validity  have  been  used,  the  traditional  formulae  have 
not  been  used  because  they  were  not  appropriate. 


. 


0- 


ASIA) 


o  '  Cm  Itr.  '•  a  : :i ’  12  .J.q:  t  •  un 


5-  ..  . .  '• 


SC0.fi 
£  :  .. 

.  ■■  .  •  .  ‘  ~  ~  i-  .  ■  - 


i-  9^t'  .  M.  TOJ~  :  .  V-’-U.  t'iiSOi"  HQ3aMl  &S&&/ 


■■  .STBi.  ■xo3ii&-v  buoanoou  irii  «gn 


*.  -  ’  ’  V  .  * '  •• 


:  ti . •  :: ‘  '■ ;  :  ■  ■  .  "  t 


. 

o  .  --M  -  ‘ 

■ 

’ 

. 

f  .  .  >.  •  .  *■.  ■  .. 

...  -  .  ■  ' 

.  ‘.I  •  ,  •'  *  4 


CHAPTER  VII 


EVALUATION  OF  THE  ITEM  SELECTION  METHOD 

The  proposed  algorithm  to  be  used  for  selecting  items  from  a 
pool  of  items  has  as  its  foundation  factor  analytic  theory.  After 
a  test  containing  many  items  has  been  administered  to  a  group  for 
which  there  are,  ideally,  several  criterion  variables  available,  the 
test  items  and  criterion  elements  are  factor  analyzed.  A  rotation  of 
the  resulting  orthogonal  factor  structure  matrix  is  applied  to  provide 
a  final  solution  that  has  simple  structure  properties.  Each  factor 
is  then  assigned  a  relative  weight  by  the  test  user.  From  knowledge  of 
the  specified  weights,  a  postulated  hypothetical  goal  vector  is  con¬ 
structed  that  precisely  determines  the  location  in  the  item  and 
criterion  space  of  a  test  having  characteristics  desired  by  the  user. 

The  simple  structure  factor  matrix  is  then  rotated  to  a  position  where 
factor  one  and  the  goal  vector  are  collinear. 

Initially,  the  two  items  having  the  highest  correlations  with 
the  goal  vector  (the  correlation  is  the  same  as  the  factor  loading  on 
factor  one)  are  selected „  A  centroid  of  the  composite  vector  is  then 
calculated  for  the  two  item  vectors.  Additional  items  are  selected 
and  added  to  the  composite  vector  which  results  in  a  shifting  in  co¬ 
ordinates  of  the  centroid.  The  objective  is  to  form  a  composite  vector, 
composed  of  the  items  selected  that  will  have  nearly  the  same  position 
in  the  item  and  criterion  space  as  the  goal  vector „  Conditions  for 


•xc  <  'Jd'rS  B  03  bsiaSBlflinibx  r  *d  i  Bii  ara^il  vi  ^ni:orain<  3cr>J  b 


ai^dvv  noUiaoq  s  oi  bb  f  >i  i  i;;  al  lidam  io:Job1  siuiDindB  slqmia  9riT 

:»ntli.oo  o-b  tojdsv  laoj  ®t  13  bar.  sno  xoloei 


io  t,  .,  jI  i  *_> r  ;j  cb  *a  i-i  J  «  aoi  * i  Is^nroo  *»dJ)  10:  I&og  9rf3 


t  i  x  1  1  -  .  sii  o,  3  rfU  rc  Kir. ti/olso 


»i  gr,,  .  v  n  aii  >x  doldw  ioJp©v  9}i:ecqai<j  i  9ti3  oJ  bsbbs  bos 


noiJ;f  9<  orfi  ylaeja  Xliw  xarii  a  >JJk  s»rf3  io  bsaoqmoD 


93 

termination  of  the  selection  process  were  presented  in  the  previous 
chapter , 

Comparison  to  Other  Models 

Common  item  parameters  resulting  from  an  item  analysis  are 

(a)  a  correlation  coefficient  expressing  the  relationship  between  an 
item  and  a  criterion  variable  or  the  total  score  of  the  test,  and 

(b)  an  item's  difficulty  index  from  which  the  variance  can  be 
calculated.  The  information  available  from  an  item  analysis  is  part 
of  the  data  used  in  the  proposed  item  selection  method.  Whereas  the 
item  parameters  are  presented  independently  in  an  item  analysis,  the 
present  system  provides  a  summary  analysis  utilizing  factor  analytic 
theory  to  define  a  common  factor  space  with  the  co-ordinates  of  each 
item  specified.  Common  item  characteristics  are  defined  by  the  use  of 
factors.  Thus,  the  proposed  technique  incorporates  the  data  available 
from  an  item  analysis  and  then  provides  an  objective  solution  for 
determining  which  is  the  "best"  item,  "second  best"  item  and  so  on. 
Clearly,  this  is  a  much  needed  procedure  required  for  evaluating  an  item 
in  relation  to  a  test  that  ijs  to^  b£  constructed » 

Multiple  correlation,  canonical  correlation  and  factor  analysis 
models  are  based  on  essentially  the  same  linear  model  employing  classical 
regression  equations.  It  was  noted  in  reviewing  multiple  correlation 
principles  that  it  is  a  superior  method  to  use  in  test  construction. 

The  desirable  characteristic  of  isolating  each  variable  and  assigning 
a  relative  weight  to  it  for  prediction  purposes,  common  to  multiple 


s  -  X  a*..  oa  ....  3 1 XJ  »i  |  .  i.  &q  a  ft  •  i 


bus  ti  s*i  9/ij  lo  »xo:>3  .  r,xoJ  srfi  10  ©icUlxev  aolisilia  a  b*a  aeil 


9/-X  «8±j  \fl ana  civil  ns  til  ylirsbnj  ^shni  ii  ^aa^aiq  91B  •xsJoar&x&q  azsiJ: 


, 


94 


correlation  and  factor  analysis,  has  been  included  in  the  proposed 
algorithm.  While  much  of  classical  test  theory  has  been  developed 
upon  the  assumption  of  unidimensionality  of  a  test,  the  item-selection 
procedure  presented  here  considers  test  multidimensionality  as  the 
general  case  with  the  unidimensional  test  being  a  special  condition 
derived  from  the  general  model.  Test  reliability  (internal  consistency) 
and  test  validity  can  be  considered  from  a  factor  analytic  point  of 
view  as  was  noted  earlier.  Thus,  the  proposed  algorithm,  in  part,  takes 
into  consideration  and  subsequently  provides  some  evidence  to  the  user 
of  the  relative  estimates  of  test  reliability  and  test  validity 
coefficients . 

As  in  scalogram  analysis,  reproducability  is  also  a  means  of 
testing  the  accuracy  of  results  in  factor  analysis.  The  development 
of  a  measure  of  homogeneity  or  scalability  which  will  completely 
specify  the  bounds  of  interrelationships  existing  among  all  items  of 
a  scale  has  been  extensively  examined  by  Lingoes  (1963) .  When  selecting 
items  by  the  writer's  technique,  the  reproducability  of  the  correlation 
coefficient  between  two  variables  may  be  considered  as  an  indication  of 
the  amount  of  "true"  score  variance  or  conversely  the  amount  of  "error" 
variance.  The  variance  of  an  item  not  accounted  for  ("error"  variance) 
on  factor  one,  which  is  collinear  with  the  goal  vector,  will  be  spread 
out  over  the  remaining  orthogonal  factors.  However,  the  error  variance 
is  not  considered  in  reproducing  correlation  coefficients  between  vectors. 
Thus,  the  calculated  correlation  coefficients,  r_,  are  estimates  of  the 
"true"  correlation  coefficient,  _r.  Differences  between  r_  and  _r  may  be 


bseoqoiq  ni  babul  vad  ead  ,el«ylan*  io3d»1  bn*  noHafsTTOD 
arfj  ,3*a3  b  io  yallaiioleiiaolbiiiu  5o  no*3q«u«aB  »r<)  noqu 

vilblXav  j«s>1  bna  .{JindeXXM  3«93  Jo  BSJBffltJas  »vlJal93  »dt  10 


gni»39l9*  nwW  a^gnx-t  ^  *»o  <n**>  ^X9vXan»3x»  ns.»d  aa*  »  •  « 

Jo  iiaUsaibn  fl  ao  fcs.7vUai.o-.  »d  ,a  asJdaX^av  owl  Jo»JaJi!9«a 


95 


positive  or  negative. 

Versatility  of  the  Selection  Algorithm 

In  keeping  with  the  previous  procedures,  each  item  receives  a 
weight  of  1  if  selected  or  a  weight  of  0  if  the  item  is  rejected. 
Although  the  simple  weighting  system  is  used,  in  general  no  restrictions 
are  placed  on  the  type  of  score  that  is  to  be  assigned  to  each  item. 

That  is,  no  restrictions  are  imposed  so  that  only  dichotomously  scored 
items  can  be  used.  The  item  score  may  be  out  of  1,  2,  3,  or  whatever 
is  desired  by  the  person  scoring  the  test  protocols. 

Although  the  proposed  method  utilizes  a  factor  analytic  method, 
no  restrictions  are  readily  apparent  as  to  why  a  method  other  than  a 
principal  component  analysis  cannot  be  used.  The  use  of  various  types 
of  correlation  matrices  appears  to  be  only  curtailed  by  the  associated 
factor  analytic  method.  Several  combinations  of  correlation  coefficients 
with  methods  of  rotating  factor  matrices  and  types  of  factor  analysis 
provide  many  possible  variations  for  the  application  of  the  algorithm. 

By  describing  the  item  and  criterion  vectors  in  a  factor  space, 
in  which  a  constructed  hypothetical  goal  test  may  be  positioned  in  an 
infinitie  number  of  locations,  theoretically  an  infinite  number  of  tests 
can  be  constructed  from  a  single  pool  of  itepis.  The  user  of  such  a 
technique  is  primarily  restricted  by  the  location  of  the  goal  vector, 
the  available  pool  of  items  and  the  restrictions  or  tolerance  limits 
deemed  necessary.  A  great  deal  of  flexibility  is  available  to  the  user 
which  should  result  in  an  increased  scope  in  test  construction. 


'  ’  y°n 

A  f  )V  9'.D  .  ;9  t85  7  )I  DO-'q  cOO  VS  9^'.  J  J  t<  * *:i  /. 


8noi  Joi33i  rj  on  Isisn  r.  ,bi8ti  i  t  mal8*8  v^Urig.  >w  9lqinif.  arfi  dguoriJlA 


,m»3 Jt  dans  oJ  bankas  a  ad  oj  at  3*rf3  OTooa  io  aqyi  edl  no  baomXq  9  3* 
baiooe  vUnomoioriolb  ytao  Jad3  oe  baaoqari  aia  eaoXJDltJeai  on  ,  ai.  iadT 
■}  3va3*dw  30  ,€  «£  X  3i/o  ad  yt  m  diode  ma3Jt  ad T  »fc98ii  ad  had  8tnw31 


;  -v  c-.i 


i  X  J  io  trial:  :  u  b  a  ;jata:Ji  \o  Xooq  sldAllava  arid 


loan  >d3  <3  i  M  jb  '  t.  /B  e  x?i  dixsil  to  1  sb  3b  >ig  A  .x*B88aDan  bamaab 


96 


Item  Pools 

If  the  proposed  item- selection  method  is  used,  greater  effort, 
than  presently  given,  and  more  detailed  knowledge  about  test  character¬ 
istics  will  probably  be  required  to  establish  a  pool  of  items.  Each 
item  should  be  checked  for  obvious  flaws  and  modified  where  necessary 
prior  to  being  included  in  an  item  pool.  As  a  result  of  an  evaluation 
of  each  item  to  determine  whether  the  item  should  be  added  to  the  pool, 
the  item  pools  should  improve  in  quality.  The  increased  standardization 
of  item  characteristics,  which  defines  the  universe  being  considered, 
provides  information  for  evaluating  any  change  in  composition  of  the 
item  pool.  Greater  summarization,  than  available  through  item  analysis 
alone,  of  item  properties  is  provided  while  increasing  the  flexibility 
of  constructing  a  test. 

Limitations 

The  complexity  of  test  composition  is  magnified  as  the  number 
of  dimensions  (factors)  to  be  weighted  increases.  When  working  with  as 
many  as  10  factors,  few  items  will,  in  the  writer's  experience,  meet 
the  necessary  criteria  for  selection  purposes.  A  matrix  of  2,  3,  or 
4  factors  is  much  easier  to  manipulate.  The  number  of  items  that  can 
be  used  from  a  pool  of  items  to  construct  a  test  will  generally  increase 
as  the  complexity  of  the  factor  space  is  decreased. 

Although  the  proposed  item  selection  method  is  "machine  dependent" 
because  the  amount  of  calculation  involved  necessitates  the  use  of  an 
electronic  computer,  no  real  problem  is  encountered  since  almost  anyone 
who  needs  a  digital  computer  has  access  to  one.  A  point  sometimes 


97 


raised  in  connection  with  computers,  is  that  no  "look”  at  the  data  and 
intermediate  results  is  possible.  There  is  nothing  to  prevent  the 
computer  from  printing  out  various  intermediate  results,  thus  allowing 
intuition  and  insight  to  be  optional,  but  not  mandatory. 

If  a  test  constructor  has  access  to  a  computer  utilizing  time¬ 
sharing  features,  computer-user  interaction  can  facilitate  immediate 
evaluation  of  a  set  of  selected  items.  Thus,  after  evaluating  the 
constructed  test,  a  decision  can  be  made  to  accept  the  selected  items 
or  to  vary  the  desired  test  characteristics  and  subsequently  select 
another  set  of  items. 

Insight  and  experience  will  be  required  in  some  cases.  If  the 
items  in  a  pool  defined  two  orthogonal  clusters  of  items  as  illustrated 
in  Figure  3  and  a  goal  vector  was  then  positioned  midway  between  the 
clusters,  the  selected  items  would  have  large  angular  departures  from 
the  goal  test  vector.  The  size  of  angle  between  the  item  and  specified 
goal  test  with  a  corresponding  low  correlation  coefficient  would 
indicate  that  the  test  constructor  should  investigate  the  possibility 
that  there  are  no  items  in  the  pool  to  adequately  test  a  particular 
domain.  A  second  suggestion  is  that  the  item  pool  may  be  subdivided 
into  two  pools  of  items.  Alternatively,  two  goal  test  vectors  located 
at  the  centroid  of  each  cluster  would  provide  for  the  construction  of 
two  tests. 

At  first  glance  one  limitation  appears  to  be  the  necessity  of 
always  being  required  to  have  one  or  more  criterion  variables.  While 
it  is  desirable  to  have  criterion  variables  as  elements  of  the  correla¬ 
tion  matrix,  it  is  not  necessary  to  include  criteria  for  the  purpose  of 
factor  analyzing  the  correlation  matrix.  The  proposed  algorithm 


gnlwoI-U  i  95bi bsunaj fix  s  jojliev  3ao  s^nlinliq  100 a  i  793uqmoo 


jrj  »i:  (5ic.  n<  e.  otioo  wol  gr tfc.wqas'iTOD  »  rf3  *  3b®1  nog 

bsb  v  t  <*d  jfifli  tooq  ur  Ji  sriJ  5Brf5  8£  rroUasasue  baooae  A  .nlumob 

b  JBDOl  :fOJD-?V  2  3Si  IfcOg  Ol  1  ,  Vl7  ti/T  92  L  .8111951  1©  ^lOOq  OW5  OjJI.i 

o  n  ol JouUsnoo  ®r!5  ml  bl'/oiq  blifQv  leJtiilJ  tiono  o  bloi  tn  o  orlJ  Jr 

,  „  .,  ,  orfT  !  l-» -t  -t.  nat  J  <1103  9fi5  JlfliXVlBnB  70 JDBI 


H' 


GV(i') 


Figure  3.  Orthogonal  Clusters  of  Items 


99 


functions  equally  well  with  or  without  criterion  components.  However, 
if  possible,  criterion  variables  should  be  included. 

One  limitation  not  previously  mentioned  is  in  the  use  of  the 
Gram-Schmidt  orthonormal  process  in  constructing  a  transformation 
matrix.  The  use  of  equal  weights  for  each  factor  is  not  acceptable. 

With  several  zero  weights,  it  is  not  possible  to  construct  an  ortho¬ 
normal  transformation  matrix  ( T) .  After  a  series  of  weights  has  been 
decided  upon,  a  check  on  the  properties  of  is  made  by  calculating 
TT '  .  If  T  T'  =  1^,  the  weighting  system  is  mathematically  acceptable. 

Test  Constructor  Involvement 

As  a  result  of  providing  a  more  elaborate  system  for  the  selec¬ 
tion  of  items  compared  to  item  analysis  procedures,  more  will  be 
required  of  the  user.  Values  for  several  "stop  criteria"  will  have  to 
be  estimated,  factors  will  have  to  be  labelled  with  appropriate  names, 
factors  must  be  weighted  and  greater  concern  will  have  to  be  devoted 
to  establishing  pools  of  items.  Rather  than  making  the  test  constructor’s 
job  easier,  the  responsibility  for  providing  various  parameters  has 
greatly  increased  the  understanding  required  of  the  user. 

It  was  not  intended  that  the  proposed  algorithm  be  presented 
with  optimal  parameters  and  then  used  in  a  routine  manner.  If  better 
tests  are  to  be  constructed,  more  analytic  procedures  are  required. 
However,  critical  decisions  regarding  acceptable  tolerance  limits  and 
test  characteristics  will  remain  with  the  test  constructor. 


.babuXoni  sd  bluod*  esXdBlifiV  nolMiXio  ,*WXaBoq  1 


.BXdelqaaoe  3 on  el  loloel  daae  ioi  airfglsw  Xaops  lo  eau  ertT  .*!«>« 


-odlio  <t»  isoHanoa  ol  sXdXeaoq  loo  el  ll  .elriglaw  oiea  UTevee  rillH 


„»sd  8 ad  eJdglaw  lo  .Bliee  a  i»ilA  .(T)  *1«M>  nollamolaoail  laorron 


»d  Ulw  9T0 «  mbaooiq  els^Iana  roll  ol  bsisqmos  aroll  lo  noil 


afiri  8Ts3  3iB«i*q  3 uoXjbv  gaJtbtvoiq  ioi  &  XXdianoqesT  drf'  ,  -  dof- 


.1  stupi’t  t.16  8910  0301  I  .  rylana  siom  .bBlodilanoo  ad  ol  »ia  e  a- 


Nw 


CHAPTER  VIII 


SUMMARY,  CONCLUSIONS  AND  IMPLICATIONS 

The  problem  being  investigated  and  a  proposed  solution  to  the 
problem  are  briefly  outlined.  Empirical  data  are  not,  at  present, 
available  but  conclusions  regarding  the  appropriateness  of  the 
algorithm  are  presented.  Since  the  present  study  has  been  concerned 
with  a  theoretical  model  that  was  not  directly  an  extension  of  previous 
research,  many  theoretical  and  practical  implications  may  be  considered. 

The  Problem  and  a  Proposed  Solution 

Since  many  items  are  available  to  construct  a  test,  test 
constructors  would  like  to  know  which  is  the  "best"  item,  "second  best" 
item  and  so  on  for  predicting  a  set  of  criteria.  The  algorithm 
presented  to  solve  this  problem  is  based  upon  factor  analytic  theory. 
Items  and  criteria  are  factored  to  define  a  common  factor  space.  A 
constructed  hypothetical  goal  vector  is  defined  in  the  factor  space. 

The  "best"  _k  items  are  selected  to  form  a  composite  vector  that  is 
nearly  collinear  with  the  goal  vector  as  defined  by  the  user. 

The  selection  of  items  is  dependent  upon  the  availability  of 
a  pool  of  items  that  have  been  administered  to  a  group  of  subjects. 
Parameters  must  be  specified  by  the  user  which  will  result  in  a  test 
being  constructed  according  to  specific  desired  characteristics.  The 
item  selection  method  provides  considerable  flexibility  in  test 


construction. 


2wom:>u<mi  aw a  i  mizvja  too  k 


b<ti  ?  '■  n  >3d  abd  ^bi«r.  Ji  >  iq  9  -  9oni?, 


J-.  '-*x'  ' 


'jssd”  sriJ  eJ  won*  03  will  bli/ow  87o3ou738fioD 


XI  .  it  'I.'  }:  Oj  b~>7033':i  »TJi  0  -'f>' 


.  I  :  Jo.  i  9113  nl  tell  Jb  i  -  0399V  laog  XBDl39xi3oqy;  bsioi/idanoo 

3o  /3.  IXdeJ lava  ori *  rtoqu  3fi~,bnj  9b  ^1  am  Jl  1c  irolsoalaa  9»i  ■ 


c  ji  i  t/o*rg  s  o3  b9->9»e.tnii  ba  rts  d  9vad  3fcd3  i  3l  o  ooq  a 


■ 


O  ’  30U3  iBfiOD 


101 


General  Conclusions 

In  theory,  the  proposed  item  selection  technique  is  directly 
related  to  a  wide  variety  of  test  construction  procedures  such  as 
item  analysis,  regression  analysis  and  factor  analysis.  A  logical 
evaluation  of  the  algorithm  for  the  selection  technique  has  revealed 
no  major  difficulties  regarding  practical  application.  Because  no 
empirical  evidence  is  available,  the  present  conclusions  are  necessarily 
theoretical.  When  evidence  for  the  use  of  the  proposed  method  and  the 
relationship  to  results  from  other  procedures  are  available,  definite 
conclusions  will  be  in  order.  However,  as  it  stands,  the  item  selection 
procedure  appears  to  have  merit  from  a  theoretical  viewpoint. 

Implications 

Although  the  proposed  algorithm  is  composed  of  commonly  used 
procedures,  this  seems  to  be  the  first  time  that  such  a  practical 
application  of  an  item  selection  method  has  been  presented  with  these 
components.  The  theoretical  foundation  does  not  appear  to  violate  the 
basis  of  measurement  theory.  Because  of  the  extreme  general  nature  of 
the  algorithm,  many  problems  are  immediately  apparent  that  require 
further  research. 

Theoretical .  The  notion  of  a  multidimensional  space,  where 
unidimensionality  is  a  special  case  of  the  general  situation,  has  been 
considered  through  factor  analytic  theory.  A  variation  in  the  pro¬ 
cedure  for  factoring  the  items  and  criteria  would  be  to  factor  the 
criteria  variables  and  then  position  the  item  vectors  in  the  criterion 


yl3D9ilb  efc  8up2rrrtD93  noUosIsB  ms Jl  br  Boq07q  ari:  «  C*°*  -  nl 


tB  riDUB  B37ub  ooiq  noli  :urr3  enoo  JesX  io  y*9 Iibv  abiw  b  o:  r»s  a-^  ; 


>i;  £  A  .aljfa-i  lolisl  hot  etay!  n»'a.J:3aW*»»  .«*•  £l6nB  mall 


i  ilssn:  aeri  sup mriaal  notliaXaa  aril  iol  mfltioaXe  aril  lo  notlauXeva 

.  not!  lollqqt  Xsailosiq  gnlbisgai  aallXuotinb  lot  >"i  on 


ylln.  aoian  sib  enolauXonoa  l»t»H  adl  .aXdaXievs  at  aanablva  soiilqms 
,43  bns  bodiam  baaoqoiq  aril  lo  aau  aril  lot  aonablvs  aariv  .  asuaioanl 


noXlaalaa  nail  aril  .abnela  It  ae  .lavawoH  .labio  ni  ad  Xiiw  anotaoiinoo 


. intoqwaiv  Xeaila  oaril  8  ooii  ItlM  avert  ol  aisaqqs  aiubaooiq 


'  x 


b:jQu  ylnc/ra  y~>  %  bszoqmo  i  al  o  HiioglE  b.ecxjoTq  3ri3  rfguoda  1/ 

XBDl73B7q  s  rl3£JB  iB/i*  soli  9di  3d  od  anrase  feirll  .aawc  ao.q 


io  aiuian  Xa  -.nag  a  »ii  »  aril  lo  aeoao-.-d  .'(load!  inamaioaea®  10  ataed 


.rf0769897  79rf37U: 


naad  asri  .noXlaolta  Xaianag  aril  lo  aaao  laloaqa  a  at .  yltXsnolana«K.tnu 


noJt!93l7D  Bd3  ni  81OJ09V  most  stU  noidiaoq  narfJ  bn*  89  s  a  i£  ' 


102 


space.  Thus,  the  unique  properties  of  the  items  in  each  test  would 
not  lead  to  variations  in  the  stable  criterion  space.  The  implication 
here  is  that  the  same  criteria  variables  would  be  used  in  comparing 
similar  items.  The  use  of  a  ’common  criterion  space'  would  provide 
the  same  marker  variables  for  items  from  different  tests  administered 
to  various  groups . 

Practical .  A  much  needed  technique  has  been  presented .  Since 
the  item  selection  method  is  objective  and  at  the  same  time  flexible 
to  the  individual  user,  many  applications  should  immediately  be  found 
in  the  routine  construction  of  tests.  Although  the  name  "item- 
selection  method"  has  been  frequently  used,  the  method  need  not  be 
restricted  only  to  the  selection  of  items.  Any  variable,  in  the  form 
of  an  element  in  a  correlation  matrix  may  be  considered  with  this 
technique . 

If  items  are  selected  from  a  pool  of  items  that  contains  the 
stem  and  the  alternatives  of  the  question  as  stored  information,  it 
should  be  possible  to  prepare  stencils  for  the  final  test  by  means  of 
a  computer  and  related  auxiliary  equipment .  Such  a  procedure  would  be 
relatively  simple  to  design. 

Implications  for  Further  Research.  The  algorithm  could  be 
used  to  test  the  effect  of  using  various  types  of  correlation  matrices, 
different  methods  of  factor  analysis  and  variations  in  the  procedures 
used  for  rotation  of  matrices  to  simple  structure.  It  may  be  especially 
interesting  to  apply  the  method  of  alpha  factor  analysis  (Kaiser  and 
Caffrey,  1965)  to  this  algorithm  since  the  correlation  coefficients  in 


.  ^  oaqa  nolTiJiio  aid  J a  srfJ  oi  anolJBliav  o3  baaT  Jon 


anliaqinoo  nJL  bsau  :nd  biwow  esldj  !xav  ai^sJlio  sfflsa  srfJ  JariJ  el  sisd 


3  '■  •>•  '  K 


bs leip.inJtr  bs  ^JasJ  inais  ij  >  tuoxl  am  JJt  10J  aaldaiiBV  laMnasn  atr&a  add 


.aquoig  auoksav  oJ 

.  b  JnaaaTq  nssd  t*;d  •  .tmioaj  babson  doim  > 


j.  JJtx  •  =»  :  :  *HB  a  a  Ja  1  vlJostdo-  %  ;  >o c  nol;  aa  m-*3 ::  >rfl 


ad  Jon  I  -  u  bodj  j  i  >ril  ,  >-.-  jo  Jn.< -f  ‘  *r  ji  *>r*  gad  bod  Jaw  jtolJoalsa 


i  ,  r  M/  i  iio  c  i  .'  !■  Jl  iO 


*.:•  r  1X1  fti  O  j  .!  M  a  »3l  5  [<  oq  b  mo  5  bdJosioa  sib  i.  51 

3-  rj  *io5  x  b  T_.  ’  :  t  /Jp:  '  i  io  asvJtJi>at  •  JXb  S'ix  bus  cbja 


,  i  3  j  :  sd3  toil  aJlar*  J  stifin  a;  cJ  ildlaeoq  sd  bluorla 


ad  blm-v  s  Lsocnq  a  dor,  .  7£j  q.  upt>  yxe  u*'i  >63i:Iai  ban  laJuqmoo  b 


p:-  •  -  >*<:£  ■-  -  '  '  -  •  -■  •  J  •  •  ••••"•  '  -  t  ! 


nJt  BJasiol 5 5 aoo  aoUalaiioo  srfj  wn  a  adJlTogla  aid*  oJ  (£d<?I  tysi55aO 


103 

an  alpha  factor  analysis  are  corrected  for  attenuation  during  the 
factoring  process. 

Several  studies  are  required  to  provide  empirical  evidence  for 
the  effectiveness  of  the  item-selection  method  in  practical  test 
construction  settings.  As  well  as  providing  evidence  for  evaluation 
of  the  selection  method,  research  is  needed  to  formally  and  empirically 
compare  the  model  to  other  presently  available  procedures  such  as  that 
proposed  by  Wherry  and  Gaylord  (1946) . 

Procedures  are  required  that  can  be  used  to  up-date  item  pools 
with  additional  items.  The  relative  effects  of  normalizing  an  item 
vector  compared  to  considering  the  communalities  of  each  item  requires 
investigation. 

A  logical  examination  should  be  carried  out  to  examine  the 
relevance  of  the  size  and  the  type  of  parameters  used  in  terminating 
the  selection  procedure.  Following  this,  a  statistical  evaluation 
should  be  presented  to  determine  the  relative  importance  of  the 
parameters  as  indices. 


>d3  noX3fi*/ns:JJs  10I  bsjosno o  bib  ale^ana  iojobX  adqXfl  na 


.ae*  >cn q  gniiodoaX 


noi  sonsbXva  XaD.tiiqtt*  sblvoiq  03  t»*Xi/p*i  sib  asXbu:T8  X»3svs8 


U  ■  ■■  '  :  1  ’•  >  1  ■  0  > 


nol3B'jI»vs  noX  oonsblva  snlbXvonq  &a  XXsw  aA  .egnX33se  ooXdsimanoo 
ylLtiiijtqi  =>  bna  ^Xla  rrol  03  bsbssa  aX  rfotrsaesT  fbori39m  noi^osX^a  srf:J  Xo 
jsdl  bb  d  esiub  joiq  sldelXtivs  ^Xdnsasiq  isdJo  03  Xsbon  ud3  anaqmoo 


.  (d^CX)  bioXx^O  bus  ^msdW  yd  bsaoqo-iq 
alooq  .  i-Ji  9JsL-qu  oU  bseu  sd  nao  3arf3  bsiiupsn  sib  asiubsooiS 


i 3 j.  r  .  i  j  i a  non  to  airosXXs  t  /  JtiJ.si  d T  .  ,,rja3l  Cnooldlbbs  d3Xw 


ili/psj  t  sjl  d  ss  io  -iSXJXIsnuinraoa  srf3  gnJL7sbX80o:>  o3  bsiaqraoo  303osv 


. noi IfcgJ lat /nX 


gnlJi  ijnns*  nl  b^ao  a  osn  nag  lo  sq^3  sd3  bna  ssXe  sd3  Xo  so  a  &  :si 


nol3ai;Xavs  l&3i3aX3B3a  s  taXd3  gniwoIXo'5 


sria  lo  >n£i*iCM  ovr3>-.  fsi  sd3  snj  -  >  ib  <  3  boUns  saq  sd  bXjooda 


. aaolboX  ib  aisJsmaiBq 


* 


REFERENCES 


Adams,  Georgie  and  Torgerson,  T.  L.  Measurement  and  evaluation  in 

education,  psychology  and  guidance.  New  York:  Holt,  Rinehart 
and  Winston,  1964. 

American  Educational  Research  Association.  Technical  recommendations 
for  achievement  tests.  Washinton,  D.  C.:  American  Educational 
Research  Association,  1955. 

American  Psychological  Association.  Technical  recommendations  for 

psychological  tests  and  diagnostic  techniques.  Washington,  D.  C.: 
American  Psychological  Association,  1954. 

American  Psychological  Association.  Standards  for  educational  and 

psychological  tests  and  manuals.  Washington,  D.  C.:  American 
Psychological  Association,  1966. 

Anastasi,  Anne.  Psychological  testing.  (2nd  ed.)  New  York:  Macmillan, 
1961 


Astin,  A.  W.  Criterion-centered  research.  Educational  and  Psycho¬ 
logical  Measurement,  1964,  24_,  807  -  822. 

Ayers,  J.  D.  Justification  of  Bloom's  taxonomy  by  factor  analysis. 
Unpublished  manuscript,  University  of  Alberta,  1966. 

i 

Baggaley,  A,  R.  Intermediate  correlational  methods.  New  York:  Wiley, 
1964. 


Bloom,  B.  S.  (ed.),  Taxonomy  of  educational  objectives,  handbook  I: 
Cognitive  domain.  New  York:  Longmans,  Green,  1956. 

Brog^en,  H.  E.  Variation  in  test  validity  with  variation  in  the  distri¬ 
bution  of  item  difficulties,  number  of  items,  and  degree  of  their 
intercorrelations.  Psychometr ika ,  1946,  LI,  197  -  214. 

Carroll,  J.  B.  An  analytical  solution  for  approximating  simple  structure 
in  factor  analysis.  Psychometrika ,  1953,  18_,  23  -  38. 

Cooley,  W.  W.  and  Lohnes ,  P.  R.  Multivariate  procedures  for  the 
behavioral  sciences.  New  York:  Wiley,  1962. 

Coombs,  C.  H.  The  concepts  of  reliability  and  homogeneity.  Educational 
and  Psychological  Measurement,  1950,  10_,  43  -  56. 


id  i(  -U  i  ±  .J  .t  ,1  oT  bn  9iai0£»0  .acsbA 

eno  j  1  Jj  nribs  _  .noJ:JaXDo&'aA  rfoiaaasfl  Ik/ioU  ,or..r)Sf  atolismA 


. 


(  )  '  -i  •  -  ■’ 


,/!.  Tasesi  b0T93n9D-n(.  -I33i  tO  .W  .A  ,nlJeA 


Ss., 


;  -  |i  A 1  i ;  *1  I  «  "'  '  '  1  * L' 

r  ^  w  /;  '-J  v.  y  y 


105 


Cronbach,  L.  J .  Coefficient  alpha  and  the  internal  structure  of  tests . 
Psychometrika,  1951,  16,  297  -  334. 

Cronbach,  L.  J.  Essentials  of  psychological  testing.  (2nd  ed.)  New 
York:  Harper,  1960. 

Cronbach,  L.  J.,  Rajaratnam,  N  and  Gleser,  Goldine  C.  Theory  of 

generalizability :  A  liberation  of  reliability  theory.  British 
Journal  of  Statistical  Psychology.  1963,  16^,  137  -  163. 

Cronbach,  L.  J.  and  Meehl,  P.  E.  Construct  validity  in  psychological 
tests.  Psychological  Bulletin,  1955,  52_,  281  -  302. 

Cureton,  E.  E.  Validity,  reliability,  and  baloney.  Educational  and 
Psychological  Measurement,  1950,  10_,  94  -  96. 

Cureton,  E.  E.  Validity.  In  E.  F.  Lindquist  (Ed.),  Educational 

measurement .  Washington,  D.  C.:  American  Council  on  Education, 
1951,  pp .  621  -  694. 

Davis,  F.  B.  Item  selection  techniques.  In  E.  F.  Lindquist  (Ed.), 

Educational  measurement.  Washington,  D.  C.  :  American  Council 
on  Education,  1951,  pp .  266  -  328. 

Douglas,  H.  and  Spencer,  P.  Is  it  necessary  to  weight  exercises  in 
standardized  tests?  Journal  of  Educational  Psychology,  1923, 

14_,  109  -  112. 

DuBois,  P.  H.  An  introduction  to  psychological  statistics.  New  York: 
Harper  and  Row,  1965. 

Ebel,  R.  L.  Measuring  educational  achievement.  New  Jersey:  Prentice 
Hall,  1965. 

Elfving,  G.,  Sitgreaves,  R.,  and  Solomon,  H,  Item  selection  procedures 
for  item  variables  with  a  known  factor  structure.  Psychometr ika , 
1959,  24,  189  -  205. 

Eysenck,  H.  J.  Uses  and  limitations  of  factor  analysis  in  psychological 

research.  In  Anne  Anastasi  (Ed.),  Testing  Problems  in  Perspective. 
Washington,  D.  C.:  American  Council  on  Education,  1966, 
pp.  355  -  359. 

Flowers,  J.  F.  The  application  of  electronic  digital  computers  to  item 

analysis  and  test  development.  Paper  read  at  Canadian  Association 
of  Professors  of  Education,  1965. 

Freeman,  F.  S0  Theory  and  practice  of  psychological  testing.  (revised 
edition)  New  York:  Holt,  1955. 


.soe  -  I8S  ,ic  feeei  .nij&jriiug  i  oisgio ,1  z*l  < 

U| 

,  [ -.s;  -  .  ,  ,i  ji  Jfioub:  no 

ij  .5  -olj.fi  c  ,»3  a  Ji  J  3ifboa :Jji  >  i  A  H  .*  t  <Alo6.ua 

lb°.l  ,vo)l  br.A  isiqi  H 


.a  i  ae  -*0J  x  r  :>  rfa  w  a«X4eli«\  c  >— 

:  :  -  LSbUf-  -  <(*b3;)  1  -  *nA  lA  nI 


Ti-jntroD  I/..5.  i:  m».  .  .5  .a  ,n  )3j.  Uri»»w  j‘H 

^d«X  *o  io 


106 


Fruchter ,  B.  Introduction  to  factor  analysis.  Princeton:  Van 
Nostrand,  1954. 

Fruchter,  B.  and  Jennings,  E.  Factor  analysis.  In  H.  Bocko  (Ed.), 

Computer  applications  in  the  behavioral  sciences .  New  Jersey: 
Prentice  Hall,  1962,  pp.  238  -  265. 

Furst,  E.  Constructing  evaluation  instruments.  New  York:  Longmans, 

Green  and  Company,  1958. 

Ghiselli,  E.  E.  Theory  of  psychological  measurement.  New  York: 
McGraw-Hill,  1964. 

Glass,  G.  V.  Alpha  factor  analysis  of  infallible  variables. 

Psychometr ika ,  1966,  3lL,  545  -  561. 

Glesser,  Goldine  and  DuBois,  P.  A  successive  approximation  method 

of  maximizing  test  validity.  Psychometr  ika ,  1951,  16_,  129  -  139. 

Green,  B.  F.,  Jr.  The  computer  revolution  in  psychometrics. 

Psychometr  ika ,  1966,  31_,  437  -  445. 

Greene,  H.  A.,  Jorgenson,  A.  N.  and  Gerber ich,  J.  R.  Measurement  and 
evaluation  in  the  secondary  school.  New  York:  Longmans,  Green 
and  Company,  1954. 

Guilford,  J.  P.  Psychometric  methods.  (2nd  ed.)  New  York:  McGraw- 
Hill,  1954. 

Guilford,  J.  P.  Fundamental  statistics  in  psychology  and  education. 

(4th  ed . )  New  York:  McGraw-Hill,  1965. 

Gulliksen ,  H.  The  relation  of  item  difficulty  and  inter-item  correlation 
to  test  variance  and  reliability.  Psychometr ika ,  1945,  10 , 

79-91 

Gulliksen,  H.  Theory  of  mental  tests.  New  York:  Wiley,  1950.  (a) 

Gulliksen,  H.  Intrinsic  validity.  American,  Psychologist,  1950,  5_, 

511  -  517.  (b) 

Guttman,  L.  Image  theory  for  the  structure  of  quantitative  variates. 

P s ychome tr ika ,  1953,  18_,  277  -  296. 

Guttman,  L.  A  generalized  simplex  for  factor  analysis.  Psychometr ika , 
1955,  20,  173  -  192. 

Harman,  H.  H.  Modern  factor  analysis.  Chicago:  University  of  Chicago 
Press,  1960. 


:  jf ioY  v*W  .  s  3n  :nmrr3anl  : 1 t  iaulr.vf>  £nl  ->t; ;  iioC  .3  ,  3 a iixl 


e*e\  U_  ,ddei  t  raorl  >7aH: 


borfuaoi  nolJenrlxcrrqqfl  svIibsooui  A  ,aJto3i/0  bn.,  anlbXoO  tioaeaiO 


V'.\  .  .  ;  h:  ,  j  • 


,*eex  « ixxh 

„  >  .  '  H  i  .  i  .YJlIirifi  ■  b  IB  9  0ni::u  v  3893  03 

. . - 


.  333<  tu .v  i  i38l3naup  Ho  t>:t  jHduiSr  'U  toH  ^icsr,  t  ar 
,i  va  «k,  ,1^;  ic;  Mai  :roH’ .xslqnrla  '.a  :>  ATaiisj*  A  .  .J  ,n6in33uri 


107 


Hays,  W.  L.  Statistics  for  psychologists.  New  York:  Holt,  Rinehart 
and  Winston,  1963. 

Helmstadter,  G.  C.  Principles  of  psychological  measurement.  New  York: 
Appleton-Century-Crof ts ,  1964. 

Hilgard ,  E.  R.  A  test  item  file  to  accompany  Hilgard 1 s  introduction  to 
psychology  3rd  edition.  New  York:  Harcourt,  Brace  and  World, 

1962. 

Hoffman,  B.  The  tyranny  of  testing.  New  York:  Crowell-Collier ,  1962. 

Hohn,  F.  E.  Elementary  matrix  algebra.  (2nd  ed.)  New  York: 

Macmillan,  1964. 

Horst,  P.  Item  selection  by  means  of  a  minimizing  function.  Psychometrika, 
1936,  1,  229  -  244. 

Horst,  P.  Optimal  test  length  for  maximum  differential  prediction. 
Psychometr ika ,  1956,  21_t  51  -  66. 

Horst,  P.  Relations  among  m  sets  of  measures.  Psychometr ika ,  1961, 

26,  129-149. 

Horst,  P.  Factory  analysis  of  data  matrices.  New  York:  Holt,  Rinehart 
and  Winston,  1965. 

Horst,  P.  Psychological  measurement  and  prediction.  California: 

Wadsworth,  1966. 

Horst,  P.  and  MacEwan,  Charlotte,  Optimal  test  length  for  maximum 
absolute  prediction.  Psychometr  ika ,  1956,  21_,  111  -  124. 

Horst,  P.  and  MacEwan,  Charlotte,  Optimal  test  length  for  multiple 

prediction:  the  general  case.  Psychometr ika ,  1957,  22. ,  311  -  324. 

Householder,  A.  S.  and  Young,  G.  Matrix  approximations  and  latent  roots. 
American  Mathematical  Monthly,  1938,  4_5,  165  -  171. 

Hoyt,  C.  Test  reliability  estimated  by  analysis  of  variance. 

Psychometrika ,  1941,  6.,  153  -  160. 

Kaiser,  H.  F.  The  varimax  criterion  for  analytic  rotation  in  factor 
analysis.  Psychometrika ,  1958,  23.,  187  -  200. 

Kaiser,  H.  F.  The  application  of  electronic  computers  to  factor 

analysis.  Educational  and  Psychological  Measurement,  1960,  .20, 

141  -  151. 


' 


,ay&H 


i .  ,  ,  7  - 1.  d  ^  1  •  t  v' 


108 


Kaiser,  H.  F.  Psychometric  approaches  to  factor  analysis.  In  Anne 
Anastasi  (Ed.),  Testing  Problems  in  Perspective.  Washington, 
D.  C.:  American  Council  on  Education,  1966,  pp„  360  -  368. 

Kaiser,  H.  F.  and  Caffrey,  J.  Alpha  factor  analysis.  Psychometrika , 
1965,  30,  1-14. 


Katzell,  R.  A.  Symposium:  The  need  and  means  of  cross-validation. 

III.  Cross-validation  of  item  analysis.  Educational  and 
Psychological  Measurement,  1951,  1I_,  16  -  22. 

Kelly,  T.  L.  The  selection  of  upper  and  lower  groups  for  the  vali¬ 
dation  of  test  items.  Journal  of  Educational  Psychology,  1939, 

30,  17-24. 

Kelly,  E.  L.  Alternate  criteria  in  medical  education  and  their 
correlates.  In  Anne  Anastasi  (Ed.),  Testing  Problems  in 
Perspective .  Washington,  D.  C.:  American  Council  on  Education, 
1966,  pp.  176  -  194. 

Kuder,  G.  F.  and  Richardson,  M.  W.  The  theory  of  the  estimation  of  test 
reliability.  Psychometrika ,  1937,  2_,  95  -  101. 

Layton,  W.  L.  The  relationship  between  the  method  of  successive 

residuals  and  the  method  of  exhaustion.  Psychometrika ,  1951, 

16,  51  -  56. 

Lennon,  R.  T.  Assumptions  underlying  the  use  of  content  validity. 

Educational  and  Psychological  Measurement,  1956,  Ij6,  294  -  304. 

Lindquist,  E.  F.  (Ed.)  Educational  Measurement.  Washington,  D.  C.: 
American  Council  on  Education,  1951. 

Lindgoes,  J.  C.  Multiple-scalogram  analysis:  A  set-theoretic  model 

for  analyzing  dichotomous  items.  Educational  and  Psychological 
Measurement ,  1963,  23_,  501  -  524. 

Loevinger,  Jane.  The  attenuation  paradox  in  test  theory.  Psychological 
Bulletin ,  1954,  51^,  493  -  504. 

Lord,  F.  M.  A  theory  of  test  scores.  Psychometric  Monographs.  1952, 

No .  7  „ 

Lord,  F.  M.  Optimum  level  of  item  difficulty.  Research  Memo, 

Princeton,  N.  J.:  Educational  Testing  Service,  1953. 

Lord,  F.  M.  Do  tests  of  the  same  length  have  the  same  standard  error  of 
measurement?  Educational  and  Psychological  Measurement ,  1957 ,  17_, 
501  -  521. 


.  M  •.  H  ,  laaliOi 

€'r*C(-  .  o  orisy  ^  rc  ic  *_  .  o  r  r:rot  rtf®  t  t  U  io  no  Jab 


.  J  .W  , rroJyaJ 


(  )  I  « )a, uphold 


.  yJluoiljJtb  osJjt  So  JLaval  ouml^qO  ,ii  ,bioJ 


109 


Lord,  F.  M.  Tests  of  the  same  length  do  have  the  same  standard  error 
of  measurement .  Educational  and  Psychological  Measurement, 

1959,  19,  233  -  239. 

Lubin,  A,  and  Osburn,  H,  G.  A  theory  of  pattern  analysis  for  the 

prediction  of  a  quantitative  criterion,  Psychometrika ,  1957, 

22,  63  -  74, 

Lumsden,  J.  The  construction  of  unidimensional  tests.  Psychological 
Bulletin,  1961,  58,  122  -  131. 

McNemar,  Q.  Psychological  statistics.  (3rd  ed . )  New  York:  Wilev. 

1962. 

Mollenkopf,  W.  G.  Variation  of  the  standard  error  of  measurement. 
Psychometrika ,  1949,  14_,  189  -  229. 

Mollenkopf,  W.  G.  Predicted  differences  and  differences  between 
predictions.  Psychometrika ,  1950,  _15_,  409  -  417. 

Mosier,  C.  I.  Symposium:  The  need  and  means  of  cross-validation. 

I.  Problems  and  designs  of  cross-validation.  Educational 
and  Psychological  Measurement,  1951,  11 ,  5-11. 

Neuhaus,  J.  0.  and  Wrigley,  C.  The  quartimax  method:  An  analytical 
approach  to  orthogonal  simple  structure.  British  Journal  of 
Statistical  Psychology,  1954,  7_,  81  -  91. 

Novick,  M.  R.  The  axioms  and  principal  results  of  classical  test 

theory.  Journal  of  Mathematical  Psychology,  1966,  _3,  1  -  18. 

Nunnally ,  J .  C . ,  Jr,  Tests  and  measurements  -  assessment  and  prediction . 
New  York:  McGraw-Hill,  1959. 

Orleans ,  J a  S .  A  test  item  file  to  accompany  Cronbach's  educational 
psychology  second  edition.  New  York:  Harcourt,  Brace  and 
World,  1963. 

Osburn,  G.  H.  and  Lubin,  A.  The  use  of  configural  analysis  for  the 
evaluation  of  test  scoring  methods.  Psychometrika ,  1957,  22_ 

359  -  372. 

Ray,  W.  So,  Hundleby,  J.  D.  and  Goldstein,  D.  A.  Test  skewness  and 

kurtosis  as  functions  of  item  parameters.  Psychometrika ,  1962, 

27_.  39  -  47. 

Remmers,  H.  H.  and  Gage,  N,  L.  Educational  measurement,  and  evaluation 
(2nd  ed.)  New  York:  Harper,  1955. 


.  :n  .iQ3i08£9JO  .10  10. 1  sa  J  rtf.ti  iis  io  ao'.l£ti«V  .  D  .  1  ,  q'  to  r' 


,j  ,i>  r3  id  taor.sit  Hit  dm  89  no  rn'tib  v*  1  t  .  .  c 

.VIA  -  fcG*  «dl  #0£PJ  «•*[.  r_  ;  >  ana  «Ji  >Tq 


‘.'.'i:- :  ,T.  .  ,  r  «  1  •  ’•  *Y'  '  MMUlfc 


110 


Richardson,  M.  W0  The  relation  between  the  difficulty  and  the 

differential  validity  of  a  test.  Psychometrika ,  1936,  33  -  49. 

Richardson,  M.  W.  and  Adkins,  D,  C0  A  rapid  method  of  selecting 
test  items o  Journal  of  Educational  Psychology,  1938,  29, 

547  -  552. 

Rozeboom,  W.  W.  Foundations  of  the  theory  of  prediction.  Homewood, 

Ill.:  Dorsey,  1966. 

Ryans,  D.  G.  Measurement  and  prediction  of  teacher  effectiveness. 

In  Anne  Anastasi  (Ed.) ,  Testing  Problems  in  Perspective. 

Washington,  D,  C.:  American  Council  on  Education,  1966,  pp .  222  -  237. 

Saunders,  D.  R0  A  computer  program  to  find  the  best-fitting 

orthogonal  factors  for  a  given  hypothesis.  Psychometrika ,  1960, 

2_5 ,  199  -  205. 

Stoker ,  H.  W.  and  Kropp,  R.  P.  Measurement  of  cognitive  processes. 

Journal  of  Educational  Measurement  1964,  1^,  39  -  42. 

Swineford,  F.  Note  on  tests  of  the  same  length  do  have  the  same 

standard  error  of  measurement.  Educational  and  Psychological 
Measurement ,  1959,  19_,  241  -  242. 

Terman,  L.  M.  and  Merrill,  Maud  A.  Stanford-binet  intelligence  scale: 
manual  for  the  third  revision.  Form  L-M.  Boston:  Houghton 
Mifflin,  1960. 

Thorndike,  R.  L.  Personnel  selection  -  test  and  measurement  techniques 
New  York:  Wiley,  1949. 

Thorndike,  R.  L.  Reliability,  In  E.  F.  Lindquist  (Ed.),  Educational 

measurement .  Washington,  D.  C.:  American  Council  on  Education, 

1951,  pp.  560  -  620. 

Thorndike,  R.  L.  and  Hagen,  Elizabeth.  Measurement  and  evaluation  in 
education  and  psychology.  New  York:  Wiley,  1955. 

Thurstone,  L.  L.  Multiple  factor  analysis.  Chicago:  University  of 
Chicago,  1947. 

Toops,  H.  A.  The  L-method.  Psychometrika ,  1941,  6_,  249  -  266. 

Traxler ,  A.  E.  Administering  and  scoring  the  objective  test.  In 

E.  F.  Lindquist  (Ed,),  Educational  measurement.  Washington,  D.C.: 
American  Council  on  Education,  1951,  pp .  329  -  416. 


„  u  t.  .cj  lo  ins.  •  «  »K  .1 


4s  £  -  ,  ,  •?!  ^namiu*** 


Ill 


Webster,  H.  Maximizing  test  validity  by  item  selection.  Psychometr ika , 
1956,  21,  153  -  164. 

Wechsler ,  D .  The  measurement  and  appraisal  of  adult  intelligence. 

(4th  ed.)  Baltimore:  Williams  and  Wilkins,  1958. 

Wherry,  R.  J.  and  Gaylord,  R.  H.  Test  selection  with  integral  gross 
score  weights.  Psychometr ika,  1946,  11^,  173  -  183. 

Wherry,  R.  J.  and  Winer,  B.  J.  A  method  for  factoring  large  numbers 
of  items.  Psychometrika ,  1953,  18_,  161  -  179. 


. 

joxriJlBa 

,'V 

■  tv;  ,  ij  WI  •./«•  dto  <■/  ♦83rf8±ew  sjo'ds 


APPENDIX  A 


ITEM  SELECTION  ALGORITHM 


Factor  Analyze  Items 
and  Criteria 


Assign  Weights 

Calculate  Co-ordinates 

to  each  factor 

9 

of  the  Goal  Vector  (GV) 

Form  a  transformation  matrix 
by  the  Gram-Schmidt  method 


Rotate 


Save  GV|<- 

I 

Form  Composite  Test  Vectors  (CT)  using 

es 

- 1 - - - ~:zzr~ 

*1  Select  i^  item  to  include  in  CT  *- 


two  items  with  largest  a ^ 


I 


a  .  -  ^  .30  and 
ll 

m 

a.,  >  E  a 

11  3-2  1 


YES 


•NO- 


Select  item  k 
and  add  to  CT 

I 


Calculate  the  angle  (J)  between 
GV  and  GT  and  the  length  o f  CT 


$  <  45°  and 
length  of  CT  >  .300 


-NO- 


-YES- 


Additional 
items  desired 


NO- 


>]  Rotate  m  -  1  vectors 
to  maximum  variance 


YjSS 


Additional  items 
in  the  pool 
of  items 


NO 


Terminate  the 
Selection  of  Items 

l . . 


Calculate  Reliability 
and  Validity  Estimates 


Si  3..rjb*i  )  93f  I'ualsO 


xi  Jif.iu 


. 

II   — ■ >■ 


"OH' 


